Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faceting bug for categorical columns #3588

Open
wirhabenzeit opened this issue Sep 11, 2024 · 29 comments
Open

Faceting bug for categorical columns #3588

wirhabenzeit opened this issue Sep 11, 2024 · 29 comments
Labels

Comments

@wirhabenzeit
Copy link

What happened?

Faceting by pl.Categorical columns results in wrong facets

alt.Chart(
    pl.from_pandas(vega_datasets.data.cars()).with_columns(
        pl.col("Origin").cast(pl.Categorical),
        pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
    )
).mark_point().properties(width=150, height=150).encode(
    x="Horsepower",
    y="Miles_per_Gallon",
    shape="Cylinders",
    color=alt.Color("Origin").scale(scheme="category10"),
).facet(row="Origin", column="Cylinders")

facets-not-ok

I am not exactly sure what is going wrong, but suddenly all American cars are in the Europe facet, some European cars are in the Japan facet, the Japanese cars are in the correct facet, the 4-Cylinder cars are in the 5 and 6-Cylinder facets, etc. (There is probably some obvious pattern here which I am missing)

I checked the Vega-Lite output and I think the issue is the sort parameter of the resulting spec file.

What would you like to happen instead?

The same code with pl.String columns works as expected:

alt.Chart(
    pl.from_pandas(vega_datasets.data.cars()).with_columns(
        pl.col("Origin"), 
        pl.col("Cylinders").cast(pl.String)
    )
).mark_point().properties(width=150, height=150).encode(
    x="Horsepower",
    y="Miles_per_Gallon",
    shape="Cylinders",
    color=alt.Color("Origin").scale(scheme="category10"),
).facet(row="Origin", column="Cylinders")

facets-ok

Which version of Altair are you using?

5.4.1

@dangotbanned
Copy link
Member

I am not exactly sure what is going wrong, but suddenly all American cars are in the Europe facet, some European cars are in the Japan facet, the Japanese cars are in the correct facet, the 4-Cylinder cars are in the 5 and 6-Cylinder facets, etc. (There is probably some obvious pattern here which I am missing)

I checked the Vega-Lite output and I think the issue is the sort parameter of the resulting spec file.

What would you like to happen instead?

The same code with pl.String columns works as expected

https://docs.pola.rs/api/python/stable/reference/api/polars.datatypes.Categorical.html

@wirhabenzeit you'll need to use pl.Categorical("lexical") for this behavior:

import altair as alt
import polars as pl
from vega_datasets import data

df = pl.DataFrame(data.cars()).with_columns(
    pl.col("Origin", "Cylinders").cast(pl.String).cast(pl.Categorical("lexical"))
)

alt.Chart(df).mark_point().properties(width=150, height=150).encode(
    x="Horsepower",
    y="Miles_per_Gallon",
    shape="Cylinders",
    color=alt.Color("Origin").scale(scheme="category10"),
).facet(row="Origin", column="Cylinders")
Output

image

@dangotbanned dangotbanned removed the bug label Sep 11, 2024
@dangotbanned dangotbanned closed this as not planned Won't fix, can't repro, duplicate, stale Sep 11, 2024
@wirhabenzeit
Copy link
Author

@dangotbanned Hmmm I think you misunderstood the issue. The issue is not that the order of facets is not lexicographical. The issue is that for categorical columns the resulting plot simply puts data points in wrong facets. If you look at the example above, then the blue points all should be in the USA facet, irrespective of the ordering of the rows.

In fact when I encountered this issue I used categorical encoding precisely to be able to specify an order, but then the plot just becomes erratic.

@dangotbanned
Copy link
Member

@dangotbanned Hmmm I think you misunderstood the issue. The issue is not that the order of facets is not lexicographical. The issue is that for categorical columns the resulting plot simply puts data points in wrong facets. If you look at the example above, then the blue points all should be in the USA facet, irrespective of the ordering of the rows.

In fact when I encountered this issue I used categorical encoding precisely to be able to specify an order, but then the plot just becomes erratic.

@wirhabenzeit Could you explain the difference between these two?

I'm more than happy to reopen the issue if I've misunderstood, but they look the same to me?

What you would like to happen

facets-ok

Output in #3588 (comment)

image

@wirhabenzeit
Copy link
Author

@dangotbanned There is no difference. Maybe I explained it poorly. My bug report is that faceting with categorical columns which are not lexical results in data points appearing in wrong facets. Above I used the lexical ordering with string-columns only to show the bug. The output I would like is the output which respects the categorical order and does not put points in wrong facets.

@mattijn
Copy link
Contributor

mattijn commented Sep 11, 2024

Thanks for raising this issue @wirhabenzeit! This is a very interesting issue you are raising. I can reproduce the issue you are describing, but I'm not sure exactly what is going on. Will investigate a bit more what changed with the categorical definition. The usage that you describe sounds solid to me. Maybe this is a regression with 5.4? Anyway, it is reproducible! Thanks again for your time to raise this issue!

@dangotbanned dangotbanned reopened this Sep 11, 2024
@wirhabenzeit
Copy link
Author

@mattijn I have looked around more and I think this goes back to vega/vega-lite#5937
Basically there is a long-standing bug in Vega-Lite with facet-sorting whenever there are missing data points in some of the facets. I did not find it initially because I was focused on pl.Categorical and did not suspect it was a problem with the sorting.

@dangotbanned
Copy link
Member

@wirhabenzeit

@dangotbanned There is no difference. Maybe I explained it poorly. My bug report is that faceting with categorical columns which are not lexical results in data points appearing in wrong facets. Above I used the lexical ordering with string-columns only to show the bug. The output I would like is the output which respects the categorical order and does not put points in wrong facets.

@mattijn

Thanks for raising this issue @wirhabenzeit! This is a very interesting issue you are raising. I can reproduce the issue you are describing, but I'm not sure exactly what is going on. Will investigate a bit more what changed with the categorical definition. The usage that you describe sounds solid to me. Maybe this is a regression with 5.4? Anyway, it is reproducible! Thanks again for your time to raise this issue!

I'm still unsure how this isn't explained by the nondeterministic ordering in polars, but reopened since @mattijn seems to get it

@joelostblom
Copy link
Contributor

Might also be related to vega/vega-lite#8675 which was reported in Altair here #3481.

@mattijn
Copy link
Contributor

mattijn commented Sep 11, 2024

Yeah, The referenced VL issues are relevant here.

But just to be complete, what is happening. Altair tries to sort the fields in your column ascending when defined as type str on an encoding channel.

So when having this data:

import polars as pl
import altair as alt

df = pl.DataFrame({"value": [2, 5, 3], "month": ["jan", "feb", "mar"]})

And visualising it with the month on the x-axis channel and the values on the color channel using a rect-mark

chart = alt.Chart(df).mark_rect().encode(
    x='month',
    color='value'
)
chart
image

It can be seen that the x-axis is ordered by feb, jan, mar, since the f comes for j in the alphabet.

So by casting the month column in the dataframe as being a categorical in order of appearance (default of polars). We get the following:

df_catg = df.with_columns(pl.col("month").cast(pl.Categorical))
chart_catg = alt.Chart(df_catg).mark_rect().encode(
    x='month',
    color='value'
)
chart_catg
image

The x-axis is now ordered by jan, feb, mar, like the order as is defined in the dataframe.

By comparing the Vega-Lite specification of both charts we notice that the categorical column is serialised differently.

Top chart, column is of type str and it becomes:

chart.to_dict()['encoding']['x']
{'field': 'month', 'type': 'nominal'}

With categorical column defined, it becomes:

chart_catg.to_dict()['encoding']['x']
{'field': 'month', 'sort': ['jan', 'feb', 'mar'], 'type': 'ordinal'}

The sort order is serlialized from the categorical definition of the month column in the DataFrame.
All good so far!


Sidenote
Observe that the type is also different, ordinal for the dataframe with month column defined as categorical and nominal for the month column just defined as str.
The effect of this is that when you use the categorical month column for the color encoding channel it is treated as an ordered categorical and therefor adding a sequential color scheme, versus the default which provides distinct color values for str values:

alt.vconcat(
    chart.encode(color="month"), 
    chart_catg.encode(color="month")
).resolve_scale(
    color="independent"
)
image

But upon adding encoding channels such as row and column this logic for sorting categorical columns in the DataFrame is breaking the rendering when there are combinations that contains no data.

The following goes well, but the order of the column may be seen as not right.

import polars as pl
import altair as alt

df = pl.DataFrame(
    {
        "time": [0, 1, 0, 1, 0, 1, 0, 1],
        "value": [0, 5, 0, 5, 0, 5, 0, 5],
        "choice": ["A", "A", "B", "B", "A", "A", "B", "B"],
        "month": ["jan", "jan", "feb", "feb", "feb", "feb", "mar", "mar"],
    }
)

chart = alt.Chart(df, height=100, width=100).mark_line().encode(
    x='time',
    y='value',
    color='choice',
    row='choice',
    column='month'
)
chart
image

So upon defining the column month as categorical, the order of the months in the column encoding is correctly sorted, but the data within the plots are incorrect.

df_catg = df.with_columns(pl.col("month").cast(pl.Categorical))
chart_catg = alt.Chart(df_catg, height=100, width=100).mark_line().encode(
    x='time',
    y='value',
    color='choice',
    row='choice',
    column='month'
)
chart_catg
image

Leading indeed to data being drawn within the wrong subplot!
So, indeed, be very careful here!

One can use the following workaround when having a polars DataFrame as in OP:

df_complete = (
    df.select(pl.col(["choice", "month"]).unique().implode())
    .explode("choice")
    .explode("month")
    .join(df, how="left", on=["choice", "month"])
)

df_complete_sorted = df_complete.sort(pl.col("month").cast(pl.Enum(["jan", "feb", "mar"])))
df_complete_catg = df_complete_sorted.with_columns(pl.col("month").cast(pl.Categorical))
df_complete_catg
image
chart_complete_catg = alt.Chart(df_complete_catg, height=100, width=100).mark_line().encode(
    x='time',
    y='value',
    color='choice',
    row='choice',
    column='month'
)
chart_complete_catg
image

This workaround basically makes sure that all combinations that are possible to make with the row/column channel encoding, are actually existing in the DataFrame, albeit filled with a null value.

@wirhabenzeit
Copy link
Author

@mattijn Thanks for investigating! As far as I can see the issue arises on a group level, e.g. when grouping by facet, but also specifying encodings such as color or shape, then data gets misplaced as soon as any group (like row a, column b, color c, shape d) contains no data points. Could there be an automatic way of detecting this on the Altair side, and issuing a warning? Probably that’s difficult in case categories are derived using transformations etc?

@mattijn
Copy link
Contributor

mattijn commented Sep 12, 2024

Thanks for your response. You mean you can introduce this behavior without a row/column encoding channel included? Do you have an example of this? That seems more troublesome and indeed require more feedback to the user. A warning at best or at least a note in the documentation.

@wirhabenzeit
Copy link
Author

No, I think without rows/columns the issue is not there. What I meant is that problems arises as soon as in some facet some color/shape group has no data points. So for the workaround one would need to fill in nulls not only for empty facets but also empty groups within a facet. In my original example above the two plots are not just different in the sense that some entire facets are in the wrong place, but the individual facets are also different. I can try to produce a more minimal example showing this.

@mattijn
Copy link
Contributor

mattijn commented Sep 12, 2024

There seems something going on with polars too. First, if I do

import polars as pl
import vega_datasets
df = pl.DataFrame(vega_datasets.data.cars()).with_columns(
    pl.col("Origin", "Cylinders").cast(pl.String).cast(pl.Categorical("lexical"))
)

The order is correct in the chart, but when doing:

df['Cylinders'].cat.get_categories().to_list() 

I get

['8', '4', '6', '3', '5']

So it is not really clear to me, how the chart specification can know the right order.

But if I try to force the categorical order using an Enum:

import vega_datasets
import polars as pl

df = pl.from_pandas(vega_datasets.data.cars()).with_columns(
    pl.col("Origin"),
    pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
)
uniq_cylinders = df['Cylinders'].unique().to_list() 
print('cast Enum', sorted(uniq_cylinders))

df_sort = df.sort(pl.col('Cylinders').cast(pl.Enum(sorted(uniq_cylinders))))  # ['3', '4', '5', '6', '8']
df_catg = df_sort.with_columns(pl.col('Cylinders').cast(pl.Categorical))

df_catg['Cylinders'].cat.get_categories().to_list()

It returns

cast Enum ['3', '4', '5', '6', '8']
['8', '4', '6', '3', '5']

And a wrongly sorted chart.
@dangotbanned, do you know more about this behaviour of polars?

@dangotbanned
Copy link
Member

dangotbanned commented Sep 12, 2024

And a wrongly sorted chart. @dangotbanned, do you know more about this behaviour of polars?

@mattijn I can help but could you add some comments - explaining the intention behind each action you've taken please?

But if I try to force the categorical order using an Enum:

Code block
import vega_datasets
import polars as pl

df = pl.from_pandas(vega_datasets.data.cars()).with_columns(
    pl.col("Origin"),
    pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
)
uniq_cylinders = df['Cylinders'].unique().to_list() 
print('cast Enum', sorted(uniq_cylinders))

df_sort = df.sort(pl.col('Cylinders').cast(pl.Enum(sorted(uniq_cylinders))))  # ['3', '4', '5', '6', '8']
df_catg = df_sort.with_columns(pl.col('Cylinders').cast(pl.Categorical))

df_catg['Cylinders'].cat.get_categories().to_list()

I'm having trouble understanding as this reads more like pandas than polars code

My immediate thoughts are:

I should have elaborated in #3588 (comment) but to me the issue seems to be wanting some explicit behavior - without using any of the explicit features of polars.

So one way to look at this, is if you tell polars to do something it will try to optimize for the fastest query to get there.
However if you have some constraint that hasn't been defined - then you may be surprised when that gets optimized out.

Maybe this section of their user guide would be helpful?

Also https://docs.pola.rs/user-guide/concepts/data-types/categoricals/

@mattijn
Copy link
Contributor

mattijn commented Sep 12, 2024

I notice one thing what is different.

If I define the dataframe as you suggested using a lexical option within the pl.Categorical() it is not persisted or included when compiling to Vega-Lite:

df = pl.DataFrame(vega_datasets.data.cars()).with_columns(
    pl.col("Origin", "Cylinders").cast(pl.String).cast(pl.Categorical("lexical"))
)

chart = alt.Chart(df).mark_point().properties(width=100, height=100).encode(
    x="Horsepower",
    y="Miles_per_Gallon",
    shape="Cylinders",
    color="Origin",
).facet(row="Origin", column="Cylinders")

print(df.get_column("Cylinders").cat.get_categories())
print(chart.to_dict()['facet'])
shape: (5,)
Series: 'Cylinders' [str]
[
	"8"
	"4"
	"6"
	"3"
	"5"
]
{'column': {'field': 'Cylinders', 'type': 'nominal'}, 'row': {'field': 'Origin', 'type': 'nominal'}}

As you can see, there is no sort defined for the column encoding channel. Therefor the order is correct in this case, but as a false-positive.

Where in my ugly (no fun indeed!) defined DataFrame it actually includes the sort for the column encoding channel.

{'column': {'field': 'Cylinders', 'sort': ['8', '4', '6', '3', '5'], 'type': 'ordinal'}, 'row': {'field': 'Origin', 'type': 'nominal'}}

Long story short, how does the inference works of a polars column casted as a lexical categorical? Is it correct that there is no sort definition in the corresponding Vega-Lite specification?

@dangotbanned
Copy link
Member

dangotbanned commented Sep 12, 2024

Long story short, how does the inference works of a polars column casted as a lexical categorical? Is it correct that there is no sort definition in the corresponding Vega-Lite specification?

Thanks @mattijn for the detail!

So this part can be answered (I think) with narwhals.is_ordered_categorical and alt.utils.core.infer_vegalite_type_for_narwhals:

altair/altair/utils/core.py

Lines 712 to 729 in a171ce8

def infer_vegalite_type_for_narwhals(
column: nw.Series,
) -> InferredVegaLiteType | tuple[InferredVegaLiteType, list]:
dtype = column.dtype
if (
nw.is_ordered_categorical(column)
and not (categories := column.cat.get_categories()).is_empty()
):
return "ordinal", categories.to_list()
if dtype in {nw.String, nw.Categorical, nw.Boolean}:
return "nominal"
elif dtype.is_numeric():
return "quantitative"
elif dtype in {nw.Datetime, nw.Date}:
return "temporal"
else:
msg = f"Unexpected DtypeKind: {dtype}"
raise ValueError(msg)

From what I'm understanding of https://github.com/narwhals-dev/narwhals/blob/aed2d515a2e26465a6edecf8d7aa560353cbdfa2/narwhals/utils.py#L401-L407

The type will be

  • "ordinal" for pl.Categorical("physical"), pl.Enum
  • "nominal" for pl.Categorical("lexical"), pl.Enum nw.Enum, pl.String

cc @MarcoGorelli to double check

Edit

Misunderstood that nw.Enum != pl.Enum nw.Enum can represent more than only pl.Enum

narwhals.is_ordered_categorical

For Polars:
Enums are always ordered.
Categoricals are ordered if dtype.ordering == "physical".

@mattijn
Copy link
Contributor

mattijn commented Sep 12, 2024

Thanks for adding more info on the table! But I'm not sure if I can read an answer in this already.
Or can I understand from here that a categorical with dtype.ordering == "lexical" is intentionaly not ordered? And therefor casting to pl.Categorical('lexcial') is correctly not adding a sort argument to the Vega-Lite specification?

@dangotbanned
Copy link
Member

Thanks for adding more info on the table! But I'm not sure if I can read an answer in this already. Or can I understand from here that a categorical with dtype.ordering == "lexical" is intentionaly not ordered? And therefor casting to pl.Categorical('lexcial') is correctly not adding a sort argument to the Vega-Lite specification?

@mattijn no worries, yeah you've understood that correctly

@MarcoGorelli
Copy link
Contributor

Thanks for the ping!

Misunderstood that nw.Enum != pl.Enum

I think they should be the same? As in, pl.Enum should be recognised as nw.Enum:

In [21]: nw.from_native(pl.Series(['a', 'b', 'c'], dtype=pl.Enum(['b', 'a', 'c', 'd'])), allow_series=True).dtype == nw.Enum
Out[21]: True

Regarding physical vs lexical, I don't think that get_categories reflects the order - but maybe it should? The difference can be seen if you compare the categories, e.g. in a sort:

In [22]: pl.Series(['b', 'a', 'c'], dtype=pl.Categorical('lexical')).sort()
Out[22]:
shape: (3,)
Series: '' [cat]
[
        "a"
        "b"
        "c"
]

In [23]: pl.Series(['b', 'a', 'c'], dtype=pl.Categorical('physical')).sort()
Out[23]:
shape: (3,)
Series: '' [cat]
[
        "b"
        "a"
        "c"
]

but they both return the same output for .cat.get_categories(). Do you need the output of .cat.get_categories to reflect the category ordering?


nw.is_ordered_categorical just does what Polars does in its dataframe interchange protocol definition:

https://github.com/pola-rs/polars/blob/501988ea1c2a114e4c28619727157354211af93a/py-polars/polars/interchange/column.py#L60-L78

        if dtype == Categorical:
            categories = self._col.cat.get_categories()
            is_ordered = dtype.ordering == "physical"  # type: ignore[attr-defined]
        elif dtype == Enum:
            categories = dtype.categories  # type: ignore[attr-defined]
            is_ordered = True
        else:
            msg = "`describe_categorical` only works on categorical columns"
            raise TypeError(msg)

the interchange protocol definition is a bit vague here, it just says "whether the ordering of dictionary indices is semantically meaningful"

@dangotbanned
Copy link
Member

Thanks @MarcoGorelli

Thanks for the ping!

Misunderstood that nw.Enum != pl.Enum

I think they should be the same? As in, pl.Enum should be recognised as nw.Enum:

In [21]: nw.from_native(pl.Series(['a', 'b', 'c'], dtype=pl.Enum(['b', 'a', 'c', 'd'])), allow_series=True).dtype == nw.Enum
Out[21]: True

So I goofed on this one 🤦‍♂️

In #3588 (comment) I was trying to explain this bit where nw.Enum is representing non-polars Enums.

AFAIK pl.Enum wouldn't reach that branch:

altair/altair/utils/core.py

Lines 712 to 722 in a171ce8

def infer_vegalite_type_for_narwhals(
column: nw.Series,
) -> InferredVegaLiteType | tuple[InferredVegaLiteType, list]:
dtype = column.dtype
if (
nw.is_ordered_categorical(column)
and not (categories := column.cat.get_categories()).is_empty()
):
return "ordinal", categories.to_list()
if dtype in {nw.String, nw.Categorical, nw.Boolean}:
return "nominal"

Maybe I should've wrote nw.Enum >= pl.Enum - or skipped the operators entirely

@MarcoGorelli

This comment was marked as outdated.

@MarcoGorelli
Copy link
Contributor

I think the issue reproduces with pandas ordered categoricals too, both on Altair 5.3.0 and Altair 5.4.1

image

code:

import altair as alt
import pandas as pd

df_catg2 = pd.DataFrame(
    {
        "time": [0, 1, 0, 1, 0, 1, 0, 1],
        "value": [0, 5, 0, 5, 0, 5, 0, 5],
        "choice": ["A", "A", "B", "B", "A", "A", "B", "B"],
        "month": ["jan", "jan", "feb", "feb", "feb", "feb", "mar", "mar"],
    }
)

df_catg2["month"] = df_catg2["month"].astype(pd.CategoricalDtype(ordered=True))
chart_catg2 = (
    alt.Chart(df_catg2, height=100, width=100)
    .mark_line()
    .encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_catg2

@mattijn
Copy link
Contributor

mattijn commented Sep 12, 2024

Pff, complicated. Altair assumes that the returned categories are in sorted order when it is defined as ordered, but this is an assumption that does not always hold.

  • Custom order with Enums is going OK in Altair.
    The column is seen as an ordered categorical. get_categories() returns the custom order as is defined. Result is that the custom order list from get_categories() is used to sort the y-encoding channel in the following chart.
my_order = ["k", "z", "b", "a"]

df = pl.from_dict({"cats": ['z', 'z', 'k', 'a', 'b'], "vals": [3, 1, 2, 2, 3]})
df = df.with_columns(pl.col("cats").cast(pl.Enum(my_order)))

nw_s = nw.from_native(df.get_column("cats"), allow_series=True)
print('ordered according to narwhals:', nw.is_ordered_categorical(nw_s))
print('sort order of get categories:', df.get_column("cats").cat.get_categories().to_list())

chart = alt.Chart(df, title='pl.Enum(my_order)').mark_bar().encode(
    x='vals',
    y='cats'
)
print('y-encoding sort definition:', chart.to_dict()['encoding']['y'])
chart
image
  • Physical categorical is going OK in Altair.
    The column is seen as an ordered categorical. get_categories() returns physical categorical sorted in physical order. Result is that the physical ordered list from get_categories() is used to sort the y-encoding channel in the following chart.
df = pl.from_dict({"cats": ['12', '4', '2'], "vals": [3, 1, 2]})
df = df.with_columns(pl.col("cats").cast(pl.Categorical()))  # 'physical'

nw_s = nw.from_native(df.get_column("cats"), allow_series=True)
print('ordered according to narwhals:', nw.is_ordered_categorical(nw_s))
print('sort order of get categories:', df.get_column("cats").cat.get_categories().to_list())

chart = alt.Chart(df, title='pl.Categorical()').mark_bar().encode(
    x='vals',
    y='cats'
)
print('y-encoding sort definition:', chart.to_dict()['encoding']['y'])
chart
image
  • Lexical categorical is ignored in Altair.
    The column is not seen as an ordered categorical. get_categories() does not return lexical categorical sorted in lexical order. Result is that the list from get_categories() is not used for the y-encoding channel. Since there is no sort defined it applies ascending sorting within Vega, making it look like that the lexical categorical has impact.
df = pl.from_dict({"cats": ['12', '4', '2'], "vals": [3, 1, 2]})
df = df.with_columns(pl.col("cats").cast(pl.Categorical('lexical')))

nw_s = nw.from_native(df.get_column("cats"), allow_series=True)
print('ordered according to narwhals:', nw.is_ordered_categorical(nw_s))
print('sort order of get categories:', df.get_column("cats").cat.get_categories().to_list())

chart = alt.Chart(df, title="pl.Categorical('lexical')").mark_bar().encode(
    x='vals',
    y='cats'
)
print('y-encoding sort definition:', chart.to_dict()['encoding']['y'])
chart
image

To support lexical categorical, it should

  1. Be considered as ordered by narwhals.
  2. The sort order of the get_categories() should be reflecting the lexical order.

Current implemention of nw.is_ordered_categorical only allows order to be defined based on numeric values and not on alphabet (lexical).

Apparently the situation is different for pandas ordered categorical. Since it does not always return the sorted physical ordered categorical.


Btw. When trying OP as you did in #3588 (comment). I get this:

import polars as pl
import vega_datasets
import altair as alt
alt.Chart(
    pl.from_pandas(vega_datasets.data.cars()).with_columns(
        pl.col("Origin").cast(pl.Categorical),
        pl.col("Cylinders").cast(pl.String).cast(pl.Categorical),
    )
).mark_point().properties(width=150, height=150).encode(
    x="Horsepower",
    y="Miles_per_Gallon",
    shape="Cylinders",
    color=alt.Color("Origin").scale(scheme="category10"),
).facet(row="Origin", column="Cylinders").to_dict()['facet']
{'column': {'field': 'Cylinders',
  'sort': ['8', '4', '6', '3', '5'],
  'type': 'ordinal'},
 'row': {'field': 'Origin',
  'sort': ['USA', 'Europe', 'Japan'],
  'type': 'ordinal'}}

With ('5.5.0dev', '1.6.2') for alt.__version__, nw.__version__. Reflecting the behaviour you have when using Altair version 5.3.0...

@MarcoGorelli
Copy link
Contributor

Reflecting the behaviour you have when using Altair version 5.3.0...

Right, sorry about that, I just did uv cache clean, reinstalled everything, and indeed I can reproduce the original post - I've marked my previous comment as outdated

Physical categorical is going OK in Altair.

Are you sure about this? It seems to me that anything which is auto-inferred to be "ordinal" (as opposed to "nominal") is subject to issues

For example, if we start with

import altair as alt
import pandas as pd
import polars as pl

df_cat = pd.DataFrame(
    {
        "time": [0, 1, 0, 1, 0, 1, 0, 1],
        "value": [0, 5, 0, 5, 0, 5, 0, 5],
        "choice": ["A", "A", "B", "B", "A", "A", "B", "B"],
        "month": ["jan", "jan", "feb", "feb", "feb", "feb", "mar", "mar"],
    }
)

then:

pandas ordered categorical: 'ordinal', incorrect data

df_cat["month"] = df_cat["month"].astype(pd.CategoricalDtype(ordered=True))
chart_cat = (
    alt.Chart(df_cat, height=100, width=100)
    .mark_line()
    .encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat

pandas unordered categorical: 'nominal', correct data (but wrong ordering)

df_cat["month"] = df_cat["month"].astype(pd.CategoricalDtype(ordered=False))
chart_cat = (
    alt.Chart(df_cat, height=100, width=100)
    .mark_line()
    .encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat

Polars physical categorical: 'ordinal', incorrect data

df_cat = pl.from_pandas(df_cat).with_columns(
    pl.col('month').cast(pl.Categorical('physical'))
)
chart_cat = (
    alt.Chart(df_cat, height=100, width=100)
    .mark_line()
    .encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat

Polars lexical categorical: 'nominal', correct data (but wrong ordering)

df_cat = pl.from_pandas(df_cat).with_columns(
    pl.col('month').cast(pl.Categorical('lexical'))
)
chart_cat = (
    alt.Chart(df_cat, height=100, width=100)
    .mark_line()
    .encode(x="time", y="value", color="choice", row="choice", column="month")
)
chart_cat

To support lexical categorical, it should

  1. Be considered as ordered by narwhals.
  2. The sort order of the get_categories() should be reflecting the lexical order.

I've tried doing this, but then the output from the example above becomes incorrect for both physical and lexical

@mattijn
Copy link
Contributor

mattijn commented Sep 13, 2024

Not sure about anything anymore, but I think we have identified at least four issues/anomalies by now:

  1. When using row/column encoding channel in combination with a sort parameter it will place your data in incorrect subplots if some of the panels has no any defined data. Related issue defined in VL Row and column sorting do not work vega-lite#5937
  2. A pandas ordered categorical returns its categories lexical sorted, not physical sorted
pd.Series(['4', '2', '12'], dtype=pd.CategoricalDtype(ordered=True)).cat.categories.to_list()
['12', '2', '4']  # physical sorted is ['4', '2', '12']
  1. A polars lexical categorical returns its categories physical sorted, not lexical sorted
pl.Series(['4', '2', '12']).cast(pl.Categorical('lexical')).cat.get_categories().to_list()
['4', '2', '12']  # lexical sorted is ['12', '2', '4']
  1. A pre-cached lexical sorted categorical remains lexical sorted upon casting to physical categorical in polars (pl_s1)
s1 = pd.Series(['4', '2', '12'], dtype='category')
s2 = pd.Series(['4', '2', '12'])

pl_s1 = pl.from_pandas(s1).cast(pl.Categorical('physical')).cat.get_categories().to_list()
pl_s2 = pl.from_pandas(s2).cast(pl.Categorical('physical')).cat.get_categories().to_list()
pl_s1, pl_s2
(['12', '2', '4'], ['4', '2', '12'])

For clarity, data without defined categorical is returning its categories sorted in physical order when casted to physical categorical in polars (pl_s2)


Regarding my comment, a few clarification notes in [italic]:

To support [inference of columns with its type casted as] lexical categorical, [the column] should

  1. Be considered as ordered by narwhals.
  2. The sort order of the get_categories() [of this column] should be reflecting the lexical order.

So basically, for proper dataframe inference of ordered categoricals then:

  • A lexical categorical should return its categories sorted in lexical order.
  • A physical categorical should return its categories sorted in physical order.

Also meaning, that this will currently lead to data being placed in wrong subplots if there are panels without data for both lexical and pysical ordered categoricals, since there will be a sort defined for the row / column encodings, as is described in point 1 in this comment.

@dangotbanned
Copy link
Member

@joelostblom
Copy link
Contributor

Forgive me if there is something I am misunderstanding, but it seems like all the issues reported here could stem from VegaLite not handling the sort keyword correctly when faceting into rows and columns as per vega/vega-lite#5937. I think it is difficult to properly troubleshoot what is happening with the categorical field sorting in row and colun facets until this is fixed in VegaLite.

Outside row and column faceting, all the scenarios with pd and pl categories work as expected as far as I can see (i.e. the order of the color scale match the order of the categories in each of these examples):

pd ordered

Identified as ordinal as expected:

image

Changing the categorical order changes the color scale order:

image

pd unordered

Identified as nominal as expected:

image

pl physical

Identified as ordinal as expected:

image

pl lexical

As already pointed out above, it seems that identification of lexical categories as ordinal data is not yet supported since it is not indicated as categorical data by narwhals and thus we get back unsorted nominal data:

image

Which would be the same as if we used the pl physical data frame and explicitly encoded the data type as nominal:

image

@c-peters
Copy link

c-peters commented Oct 1, 2024

I'm not too familiar with what happens in Altair / Narwhals, but indeed the call to get_categories does not return the categories in sorted order.

Would it be possible to call .sort() for the lexical ordered case: pl.col("category_column").cat.get_categories().sort()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants