User-defined functions documentation doesn't tell reader how to write fast functions #14699

itamarst · 2024-02-26T15:51:22Z

Description

https://docs.pola.rs/user-guide/expressions/user-defined-functions/ talks about Python functions, the slowest option.

A later section, NumPy, does talk about using NumPy ufuncs, but the title of the section is "NumPy" so unless you already know that NumPy has this functionality you won't know to look there.

And fast-and-flexible option of Numba isn't mentioned anywhere.

I therefore propose updating the user-defined functions page as follows:

Add discussion of NumPy.
Add Numba unfuncs.
Add Numba gufuncs for operating on series (possibly in a different section).

This may involve merging or moving some of the NumPy content, not sure yet.

Link

https://docs.pola.rs/user-guide/expressions/user-defined-functions/

itamarst · 2024-02-26T15:51:40Z

I will try to work on this. Depending how long it gets this might end up being a series of issues+PRs.

stinodego · 2024-02-27T00:19:37Z

There's a (somewhat) related PR open already: #13392

itamarst · 2024-02-27T14:37:50Z

Thanks! I'll take the comments and info there into account. But at this point I'm contemplating a much more significant rewrite, given these APIs are so tricky to use correctly.

itamarst · 2024-02-28T15:32:57Z

More problems: the documented behavior of map_batches() on this page doesn't match the demonstrated behavior (or perhaps the underlying behavior?). For example, it says:

Ouch.. we clearly get the wrong results here. Group "b" even got a value from group "a" 😵.

Except the actual output in the documentation isn't the wrong results, and group "b" does not in fact have values from group "a"...

MarcoGorelli · 2024-02-28T15:37:28Z

yeah it needs updating since #13181

itamarst · 2024-02-28T15:40:37Z

OK so what are the expected semantics of map_batches()? Is the batch is always the original series (in select()) or always the group (in group_by())?

MarcoGorelli · 2024-02-28T15:48:05Z

that looks right

(.venv) marcogorelli@DESKTOP-U8OKFP3:~/scratch$ cat t.py
import polars as pl

def func(x):
    print('batch is: ', x)
    return x

df = pl.DataFrame({'group': ['a', 'a', 'b'], 'value': [1, 2, 3]})

df.select(pl.col('value').map_batches(func))
df.select(pl.col('value').map_batches(func).over('group'))
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/scratch$ python t.py
batch is:  shape: (3,)
Series: 'value' [i64]
[
        1
        2
        3
]
batch is:  shape: (2,)
Series: '' [i64]
[
        1
        2
]
batch is:  shape: (1,)
Series: '' [i64]
[
        3
]

itamarst · 2024-02-28T16:14:14Z

If that's the case, my first inclination is to not document map_elements() at all in the user guide, nor refer to it in API docs for map_batches()? Since the fact map_elements() sometimes takes a single element and sometimes takes a whole Series seems problematic. Or am I missing something?

cmdlineluser · 2024-02-28T16:23:04Z

Somewhat related:

Documentation for map_batches / map_elements / map_groups / map_rows is confusing #14521

itamarst · 2024-02-28T16:50:53Z

I guess the complexity argument in #14521 suggests one should only use map_batches() for the UDF docs, and as suggested in #14521 document the different APIs in its whole own document.

deanm0000 · 2024-03-01T23:27:37Z

If you exclude groups and you have func which takes a python scaler and returns a scaler then

then

df.select(pl.col('a').map_elements(func))

Is mostly there same as

df.select(pl.col('a')
.map_batches(lambda x : (
pl.Series([func(y) for y in x])
))
)

So it's just a convenience shortcut really.

itamarst added the documentation Improvements or additions to documentation label Feb 26, 2024

itamarst mentioned this issue Feb 28, 2024

Generalized ufunc functions have inconsistent behavior between NumPy and Polars when returning scalars #14748

Closed

2 tasks

itamarst mentioned this issue Mar 20, 2024

docs(python): More accurate and helpful docs for user defined functions #15194

Merged

ritchie46 closed this as completed in #15194 Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User-defined functions documentation doesn't tell reader how to write fast functions #14699

User-defined functions documentation doesn't tell reader how to write fast functions #14699

itamarst commented Feb 26, 2024

itamarst commented Feb 26, 2024

stinodego commented Feb 27, 2024 •

edited

Loading

itamarst commented Feb 27, 2024

itamarst commented Feb 28, 2024 •

edited

Loading

MarcoGorelli commented Feb 28, 2024

itamarst commented Feb 28, 2024 •

edited

Loading

MarcoGorelli commented Feb 28, 2024

itamarst commented Feb 28, 2024 •

edited

Loading

cmdlineluser commented Feb 28, 2024

itamarst commented Feb 28, 2024

deanm0000 commented Mar 1, 2024

User-defined functions documentation doesn't tell reader how to write fast functions #14699

User-defined functions documentation doesn't tell reader how to write fast functions #14699

Comments

itamarst commented Feb 26, 2024

Description

Link

itamarst commented Feb 26, 2024

stinodego commented Feb 27, 2024 • edited Loading

itamarst commented Feb 27, 2024

itamarst commented Feb 28, 2024 • edited Loading

MarcoGorelli commented Feb 28, 2024

itamarst commented Feb 28, 2024 • edited Loading

MarcoGorelli commented Feb 28, 2024

itamarst commented Feb 28, 2024 • edited Loading

cmdlineluser commented Feb 28, 2024

itamarst commented Feb 28, 2024

deanm0000 commented Mar 1, 2024

stinodego commented Feb 27, 2024 •

edited

Loading

itamarst commented Feb 28, 2024 •

edited

Loading

itamarst commented Feb 28, 2024 •

edited

Loading

itamarst commented Feb 28, 2024 •

edited

Loading