Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Right-aligned numbers in dataframe printount #7378

Closed
2 tasks done
mcrumiller opened this issue Mar 6, 2023 · 12 comments · Fixed by #7475
Closed
2 tasks done

Right-aligned numbers in dataframe printount #7378

mcrumiller opened this issue Mar 6, 2023 · 12 comments · Fixed by #7475
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Mar 6, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

It's pretty standard in tables to right-align numbers and left-align text. It looks a bit prettier IMO.

Reproducible example

import polars as pl

print(
    pl.Dataframe({
        'a': ['aa', b', 'cc', 'd', 'ee', 'f', 'gg', 'h'],
        'b': [1, 31, 2, 4, 5, 66, 99, 103],
    })
)
shape: (8, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ aa  ┆ 1   │
│ b   ┆ 31  │
│ cc  ┆ 2   │
│ d   ┆ 4   │
│ ee  ┆ 5   │
│ f   ┆ 66  │
│ gg  ┆ 99  │
│ h   ┆ 103 │
└─────┴─────┘

Expected behavior

shape: (8, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ aa  ┆   1 │
│ b   ┆  31 │
│ cc  ┆   2 │
│ d   ┆   4 │
│ ee  ┆   5 │
│ f   ┆  66 │
│ gg  ┆  99 │
│ h   ┆ 103 │
└─────┴─────┘

Installed versions

---Version info---
Polars: 0.16.11
Index type: UInt32
Platform: Windows-10-10.0.19044-SP0
Python: 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]
---Optional dependencies---
pyarrow: 9.0.0
pandas: 1.5.3
numpy: 1.24.2
fsspec: <not installed>
connectorx: 0.3.0
xlsx2csv: 0.8
deltalake: <not installed>
matplotlib: 3.6.1
@mcrumiller mcrumiller added bug Something isn't working python Related to Python Polars labels Mar 6, 2023
@mcrumiller mcrumiller changed the title Numbers should be right-aligned in dataframe printount Right-aligned numbers in dataframe printount Mar 6, 2023
@zundertj zundertj added enhancement New feature or an improvement of an existing feature and removed bug Something isn't working labels Mar 7, 2023
@alicja-januszkiewicz
Copy link
Contributor

alicja-januszkiewicz commented Mar 10, 2023

I'd also add a Rust tag as the python's DataFrame.__str__ simply calls the rust's PyDataFrame.as_str, so impl Display for DataFrame would need to be updated.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Mar 13, 2023

I'm not sold on this as the default yet, if only because our float formatting doesn't fix the number of decimals. I think if we could also add that at the same time (eg: default to 6dp with an option to modify) then it might be more compelling?

Example:

import polars as pl
df = pl.DataFrame({
    "str": ["aaa", "bbb", "ccc"],
    "flt": [1.23456789,12435.645,31.999],
})

Currently...

shape: (3, 2)
┌─────┬───────────┐
│ str ┆ flt       │
│ --- ┆ ---       │
│ str ┆ f64       │
╞═════╪═══════════╡
│ aaa ┆ 1.234568  │
│ bbb ┆ 12435.645 │
│ ccc ┆ 31.999    │
└─────┴───────────┘

vs (right aligned)

shape: (3, 2)
┌─────┬───────────┐
│ str ┆       flt │
│ --- ┆       --- │
│ str ┆       f64 │
╞═════╪═══════════╡
│ aaa ┆  1.234568 │
│ bbb ┆ 12435.645 │
│ ccc ┆    31.999 │
└─────┴───────────┘ 

As the decimal point is still all over the place (as dp not constant) we don't actually gain much (any?) readability 🤔


However, with fixed dp precision, such as...

shape: (3, 2)
┌─────┬──────────────┐
│ str ┆          flt │
│ --- ┆          --- │
│ str ┆          f64 │
╞═════╪══════════════╡
│ aaa ┆     1.234568 │
│ bbb ┆ 12435.645000 │
│ ccc ┆    31.999000 │
└─────┴──────────────┘

...or...

shape: (3, 2)
┌─────┬───────────┐
│ str ┆       flt │
│ --- ┆       --- │
│ str ┆       f64 │
╞═════╪═══════════╡
│ aaa ┆     1.234 │
│ bbb ┆ 12435.645 │
│ ccc ┆    31.999 │
└─────┴───────────┘

...I think it would have a lot more value, as the decimal point (and hence the magnitude of the value) is immediately comparable between rows.

@mcrumiller
Copy link
Contributor Author

Yeah, I agree for decimals it's weird.

An alternative that may be nonstandard is to align decimals but not display all, e.g.:

shape: (3, 2)
┌─────┬──────────────┐
│ str ┆          flt │
│ --- ┆          --- │
│ str ┆          f64 │
╞═════╪══════════════╡
│ aaa ┆     1.234568 │
│ bbb ┆ 12435.645    │
│ ccc ┆    31.999    │
└─────┴──────────────┘ 

Ok, please don't do that, it's ugly as heck.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Mar 13, 2023

Ok, please don't do that, it's ugly as heck.

Let's say that would be... "novel" :)

@mcrumiller
Copy link
Contributor Author

I like 3 decimals as default, with the option to increase.

@alicja-januszkiewicz
Copy link
Contributor

alicja-januszkiewicz commented Mar 13, 2023

While on the topic of float formatting, how should we handle scientific notation? It kinda messes up the alignment:

pl.Config.set_float_precision(3)
pl.DataFrame({
    'a': [45231.1, 2.22, 99999999.333], 
    'b': [4.10, 115.23, 6.3200000004570000024], 
    'c': [714.3, 8.424, 9.24222]
})

shape: (3, 3)
┌───────────┬─────────┬─────────┐
│         abc │
│       --------- │
│       f64f64f64 │
╞═══════════╪═════════╪═════════╡
│ 45231.1004.100714.300 │
│     2.220115.2308.424 │
│   1.000e86.3209.242 │
└───────────┴─────────┴─────────┘

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Mar 14, 2023

While on the topic of float formatting, how should we handle scientific notation? It kinda messes up the alignment:

This might be helpful on that front?

Also, let's go with 6dp to start with; I remember implementing the same sort of thing years ago in JPMorgan and the feedback was that 3dp was actually too few for a lot of common cases; better to offer a fuller picture by default, and have the option to tune it down to people's preferences / use-cases.

Update: actually, as a default, let's not enable fixed-precision decimal places at all; having the option will allow us to experiment with different values first.

@alicja-januszkiewicz
Copy link
Contributor

alicja-januszkiewicz commented Mar 14, 2023

Sorry, perhaps I wasn't clear, I meant to ask how would we handle the case where we have mixed notations in a single column, like we do in column a in the following example:

┌─────────┬─────────┬─────────┬─────────┐
│       a ┆       b ┆       c ┆       d │
│     --- ┆     --- ┆     --- ┆     --- │
│     f64 ┆     f64 ┆     f64 ┆     f64 │
╞═════════╪═════════╪═════════╪═════════╡
│  24.010 ┆ 2.401e1 ┆   1.020 ┆ 1.020e0 │
│ 8.252e7 ┆ 8.252e7 ┆  14.500 ┆ 1.450e1 │
│ 342.040 ┆ 3.420e2 ┆ 342.042 ┆ 3.420e2 │
│ 4.295e6 ┆ 4.295e6 ┆   9.420 ┆ 9.420e0 │
│   4.400 ┆ 4.400e0 ┆   4.422 ┆ 4.422e0 │
│ 9.922e7 ┆ 9.922e7 ┆ 122.230 ┆ 1.222e2 │
└─────────┴─────────┴─────────┴─────────┘

These currently can come about when some of the values are above a certain magnitude or length threshold.

I suppose one solution could be to simply always display the floats in scientific notation, no matter their magnitude, as shown in column b. However, this would mean column c would be displayed as column d, which to me feels like a downgrade in terms of readability.

Another approach would be to use the scientific notation on a per column basis when one of the values in that column is over the threshold. The downside is that, for some use cases, in a large enough dataset there is bound to be a single outlier value that would cause the whole column of otherwise small values to be displayed in scientific notation.

Lastly, we could apply this per-column rule while only considering the values currently being printed, rather than the whole column. However, this would mean that printing different df slices could potentially print the same column in different notations, which would be rather unintuitive.

I'm almost tempted not to worry about cases like column a as there is not much point in aligning those values in different notations, as even when aligned they wouldn't be comparable due to their different magnitudes. It just really looks ugly though.
Edit: I suppose their magnitudes would be comparable.

@alexander-beedie
Copy link
Collaborator

I meant to ask how would we handle the case where we have mixed notations in a single column

Good point... I think the best/straightforward option for now probably is a, despite aesthetic reservations ;) Another option (which is what I initially thought the SO answer was referring to - oops) is to unpack the value according to the magnitude of eNN, such that 8.252e7 => 82520000.000, ditching scientific notation entirely (ideally not losing any precision that may be 'behind the scenes', as it were).

This would probably net the best consistency, though with the downside that scientific notation is most likely to kick-in when the magnitude is really large, and you'd probably appreciate the brevity, hmm. What do you think? Stick with a for now and iterate in a second pass, or unpack so everything lines-up? (I agree with your thoughts about c => d).

@alicja-januszkiewicz
Copy link
Contributor

I have implemented a in #7475 for the time being.

With the other option it'd be more of a case of not packing the value in the first place. Some threshold should exist though, as printing 1e30 or 1e60 in normal notation seems like a bad default.

Perhaps the threshold should be its own setting too? Say POLARS_FMT_NUM_LEN, or perhaps we could rename POLARS_FMT_STR_LEN to something like POLARS_FMT_COL_LEN and use that for both?

The issue raised in that SO thread is also worth implementing, but I just haven't gotten around to that yet :-)

@alexander-beedie
Copy link
Collaborator

I have implemented a in #7475 for the time being.

Nice; I'll review shortly :)

FYI: I spotted we have fn fmt_float, which seems to address some of these issues; perhaps we can add some extra options there, as needed? Something to look at in a second pass.

@mcrumiller
Copy link
Contributor Author

I think this fell by the wayside, any chance of reviving?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants