Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement for describe #8093

Closed
stevenlis opened this issue Apr 9, 2023 · 9 comments
Closed

Improvement for describe #8093

stevenlis opened this issue Apr 9, 2023 · 9 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@stevenlis
Copy link

stevenlis commented Apr 9, 2023

Problem description

import polars as pl

df = pl.DataFrame(
    {"A": [1.0, 2.0, 3.0, 4.0, 6.0, 8.0],
     "B": [1.4, 2.0, 3.3, 4.6, 6.0, 8.0]}
)
  1. There is no median in Series.describe() but only DataFrame.
df.describe()

shape: (7, 3)
┌────────────┬──────────┬──────────┐
│ describeAB        │
│ ---------      │
│ strf64f64      │
╞════════════╪══════════╪══════════╡
│ count6.06.0      │
│ null_count0.00.0      │
│ mean4.04.216667 │
│ std2.6076812.503131 │
│ min1.01.4      │
│ max8.08.0      │
│ median3.53.95     │
└────────────┴──────────┴──────────┘
df['A'].describe()

shape: (6, 2)
┌────────────┬──────────┐
│ statisticvalue    │
│ ------      │
│ strf64      │
╞════════════╪══════════╡
│ min1.0      │
│ max8.0      │
│ null_count0.0      │
│ mean4.0      │
│ std2.607681 │
│ count6.0      │
└────────────┴──────────┘
  1. Add a precision/format/decimals param in describe. Since the describe returns a DataFrame that does not have a .round method and the first column is a string, it perhaps makes sense to add a param in the function instead.

  2. Add a orientation param so that each stats becomes a column (like what you usually see in journal papers). When you have many variables to describe, it becomes hard to scroll horizontally and read the results, and the current workaround with transpose doesn't look so nice. Since each stats is a column now, it also allows you to format them easily, such as convert count to an int instead of a float. (this perhaps could be solved by first_column_as_header in transpose #8095)

df.describe().transpose(include_header=True)
shape: (3, 8)
┌──────────┬──────────┬────────────┬─────────────┬─────────────┬──────────┬──────────┬─────────────┐
│ columncolumn_0column_1column_2column_3column_4column_5column_6    │
│ ------------------------         │
│ strstrstrstrstrstrstrstr         │
╞══════════╪══════════╪════════════╪═════════════╪═════════════╪══════════╪══════════╪═════════════╡
│ describecountnull_countmeanstdminmaxmedian      │
│ A6.00.04.02.6076809621.08.03.5         │
│          ┆          ┆            ┆             ┆ 0810595     ┆          ┆          ┆             │
│ B6.00.04.2166666662.5031313721.48.03.949999999 │
│          ┆          ┆            ┆ 6666662349187     ┆          ┆          ┆ 9999997     │
└──────────┴──────────┴────────────┴─────────────┴─────────────┴──────────┴──────────┴─────────────┘
  1. Add 25th and 75th percentiles, skewness, kurtosis into the results. Those stats are very important and handy to help you to identify if your data distribution is normal or skewed at a glance. If this is too much info for normal users, maybe add a param for five-number summary
@stevenlis stevenlis added the enhancement New feature or an improvement of an existing feature label Apr 9, 2023
@zundertj
Copy link
Collaborator

zundertj commented Apr 10, 2023

There is no median in Series.describe() but only DataFrame.

I will raise a PR to fix this. Also noticed the ordering is not consistent between the two.

Add a precision/format/decimals param in describe.

This is a question about formatting dataframes in the output in the terminal, not specific to describe. Please open a separate issue.

Add a orientation param so that each stats becomes a column

This will be a challenge, as it means that different data types all need to be represented as the same data type. Note that describe returns a DataFrame, it does not print to the terminal by itself, so we are limited to constructing a dataframe first. We could cast everything to string potentially, but when you see that the "min" of a Float64 column is a string value, that could confuse users? Need to think about this. Have you tried glimpse btw? Not stats, more like head, but does give you the sort of layout you may want.

Add 25th and 75th percentiles, skewness, kurtosis into the results.

I appreciate the suggestion, but incorporating nice-to-haves without considering downsides it not the way to go for a library. Just because you think these specific stats are important, does not mean everyone thinks that way. Another person may think those are useless stats (for his or her use case) and has four other stats that are "very important". Yet another person comes along and complains the output is too complicated, and yet another person complains that on his or her very large dataframe the stats take too long to compute. Adding flags to turn things on leads to a debate about default values, more complexity, etc, etc. I.e. you get the point. Not saying we should never add or change statistics, but I would prefer to do this through a more thorough write up discussing both pros and cons. Feel free to make such a proposal.

In the mean time, I think you are best off writing your own describe function. I feel this is one of those methods that everyone prefers to see different things from, which tells us that it is best left to the user. The existing describe imo strikes the right balance for the "average use case" to get going. As you say, your proposal may not be for every user. Writing your own version is very easy due to Polars expressions syntax; describe is not much more than a tiny wrapper around those. Hence my point about this being "nice-to-have": it is perfectly possible, and I would say easy, to create your own version in the way you would want it to be.

@alexander-beedie
Copy link
Collaborator

FYI: as for (2), this PR (in progress) would allow the number of displayed decimal places to be set frame-wide: #7475

@stevenlis
Copy link
Author

@zundertj Thanks for the suggestions.

This is a question about formatting dataframes in the output in the terminal, not specific to describe. Please open a separate issue.

I'm not quite sure how this is related to dataframes formatting in general. Do you mean there should be a round method for dataframe, which automatically excludes string variables? I remember @ritchie46 mentioned that he doesn't wanna add too many methods for dataframe.

We could cast everything to string potentially, but when you see that the "min" of a Float64 column is a string value, that could confuse users? Need to think about this. Have you tried glimpse btw? Not stats, more like head, but does give you the sort of layout you may want.

If you have a string variable, it will become a string column indeed. I think it's fine and at least you can format some of the columns like count given each stats is a column now instead of a row. I saw someone mentioned glimpse here: #8095

Just because you think these specific stats are important, does not mean everyone thinks that way. Another person may think those are useless stats (for his or her use case) and has four other stats that are "very important". Yet another person comes along and complains the output is too complicated, and yet another person complains that on his or her very large dataframe the stats take too long to compute. Adding flags to turn things on leads to a debate about default values, more complexity, etc, etc. I.e. you get the point. Not saying we should never add or change statistics, but I would prefer to do this through a more thorough write up discussing both pros and cons. Feel free to make such a proposal.

I'm not quite sure how I should interpret your arguments here, and those are some basic points that can be used against any ideas I believe. I didn't see any survey data in terms of what kind of folks are using polars, and what a use case would be considered as a "average use case", and what would not. One can literally argue std is not necessary in the describe output. I don't know how to argue the importance of checking the skewness of your data. There are tons of articles out there. But I think five-number summary is the bare minimum we should have. Anyway, I get the idea that you don't like it. I respect your opinion. Thanks for the PR to add the median.

@stevenlis
Copy link
Author

I found another issue that for a boolen column if all None, the Mean is NaN instead of null like other dtypes.

import polars as pl

pl.DataFrame(
    {'a': [None, None, None],
     'b': [None, None, None],
     'c': [None, None, None],
     'd': [None, None, None]},
    schema={'a': pl.Float32,
            'b': pl.Int64,
            'c': pl.Utf8,
            'd': pl.Boolean}
).describe()


shape: (7, 5)
┌────────────┬──────┬──────┬──────┬──────┐
│   describeabcd │
│        --------------- │
│        strf64f64strf64 │
╞════════════╪══════╪══════╪══════╪══════╡
│      count3.03.033.0 │
│ null_count3.03.033.0 │
│       meannullnullnullNaN │
│        stdnullnullnullnull │
│        minnullnullnullnull │
│        maxnullnullnullnull │
│     mediannullnullnullnull │
└────────────┴──────┴──────┴──────┴──────┘

@stevenlis
Copy link
Author

And also found that NaN is not dropped like null so that both mean and std return as NaN if there is a NaN in a column.

import polars as pl

pl.DataFrame(
    {'a': [None, float('nan'), 1, 2, 4, 5],
     'b': [None, None, 1, 2, 4, 5],
     'c': [float('nan'), float('nan'), 1, 2, 4, 5]}
).describe()

shape: (7, 4)
┌────────────┬─────┬──────────┬─────┐
│ describeabc   │
│ ------------ │
│ strf64f64f64 │
╞════════════╪═════╪══════════╪═════╡
│ count6.06.06.0 │
│ null_count1.02.00.0 │
│ meanNaN3.0NaN │
│ stdNaN1.825742NaN │
│ min1.01.01.0 │
│ max5.05.05.0 │
│ median4.03.04.5 │
└────────────┴─────┴──────────┴─────┘

@zundertj
Copy link
Collaborator

This is a question about formatting dataframes in the output in the terminal, not specific to describe. Please open a separate issue.

I'm not quite sure how this is related to dataframes formatting in general. Do you mean there should be a round method for dataframe, which automatically excludes string variables? I remember @ritchie46 mentioned that he doesn't wanna add too many methods for dataframe.

DataFrame.describe() returns a dataframe. You are asking for

a precision/format/decimals param

So this is about the formatting of the outputted dataframe, right? I.e. I can see that per your example it is annoying that for some stats there are many decimals, cluttering the result. But as describe returns a dataframe, and this is about reading the result from the terminal, this is about formatting of dataframe display in the terminal, and therefore not describe specific?

On the final discussion point: I think everyone agrees that skewness can be important, that std can be highly misleading with outliers, etc etc. No need for references here. My point is we should be thoughtful about adding new features, tucked away behind a flag or not, to Polars, because there are more users than the three of us in this particular thread. I'm trying to make sure everyone's voice, even if not present here, is being taken into account. That includes beginners who might be overwhelmed if we add many of these sort of flags to the api, or users who have different sorts of metrics they want to look at.

Let's try to be constructive on this. I'll start: Pandas (and yes, we should not just copy Pandas, I have made that point myself often enough ...) has for numeric data: count, mean, std, min / 25 / 50 / 75 / max. So we have in addition null_count (because Polars count means number of entries, null or non-null), and we are missing 25% and 75%. We could add 25% and 75% probably at almost zero runtime expense once the data is sorted I believe. So the pros would be: we are in line with Pandas + we report more and the drawback is that we have two more outputs + only valid for numeric data.

@stevenlis
Copy link
Author

@zundertj Thanks for getting back on this. I indeed think a precision/format/decimals param should affect the describe output dataframe only. But I guess a global setting for floating formatting as @alexander-beedie mentioned should works fine. I will try it out once the PR is merged.

I agree with everything you said. I think at least 25% and 75% should be added, along with mean and median, you can get a sense of the skewness of your data. I just found that polars already have a expr.skew...

Btw, for some reason, the current output is also not sorted... the median comes after max... instead of min, median, max...which looks a bit odd to me.

I also wonder your opinion on the nan behaviors I posted above.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Apr 12, 2023

FYI: seems the percentiles were already available on the Rust-side equivalent, and just got added to Python: #8169. I'm going to fix-up the typing on that commit (bit too restrictive with list) and also allow no percentiles to be returned.

@miroslaavi
Copy link

Thanks for adding the 25% and 75% to the describe method, very useful.

Any thoughts of adding unique count as well? It is quite nice with categorical columns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants