Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-defined functions documentation doesn't tell reader how to write fast functions #14699

Closed
itamarst opened this issue Feb 26, 2024 · 11 comments · Fixed by #15194
Closed

User-defined functions documentation doesn't tell reader how to write fast functions #14699

itamarst opened this issue Feb 26, 2024 · 11 comments · Fixed by #15194
Labels
documentation Improvements or additions to documentation

Comments

@itamarst
Copy link
Contributor

Description

https://docs.pola.rs/user-guide/expressions/user-defined-functions/ talks about Python functions, the slowest option.

A later section, NumPy, does talk about using NumPy ufuncs, but the title of the section is "NumPy" so unless you already know that NumPy has this functionality you won't know to look there.

And fast-and-flexible option of Numba isn't mentioned anywhere.

I therefore propose updating the user-defined functions page as follows:

  1. Add discussion of NumPy.
  2. Add Numba unfuncs.
  3. Add Numba gufuncs for operating on series (possibly in a different section).

This may involve merging or moving some of the NumPy content, not sure yet.

Link

https://docs.pola.rs/user-guide/expressions/user-defined-functions/

@itamarst itamarst added the documentation Improvements or additions to documentation label Feb 26, 2024
@itamarst
Copy link
Contributor Author

I will try to work on this. Depending how long it gets this might end up being a series of issues+PRs.

@stinodego
Copy link
Member

stinodego commented Feb 27, 2024

There's a (somewhat) related PR open already: #13392

@itamarst
Copy link
Contributor Author

Thanks! I'll take the comments and info there into account. But at this point I'm contemplating a much more significant rewrite, given these APIs are so tricky to use correctly.

@itamarst
Copy link
Contributor Author

itamarst commented Feb 28, 2024

More problems: the documented behavior of map_batches() on this page doesn't match the demonstrated behavior (or perhaps the underlying behavior?). For example, it says:

Ouch.. we clearly get the wrong results here. Group "b" even got a value from group "a" 😵.

Except the actual output in the documentation isn't the wrong results, and group "b" does not in fact have values from group "a"...

@MarcoGorelli
Copy link
Collaborator

yeah it needs updating since #13181

@itamarst
Copy link
Contributor Author

itamarst commented Feb 28, 2024

OK so what are the expected semantics of map_batches()? Is the batch is always the original series (in select()) or always the group (in group_by())?

@MarcoGorelli
Copy link
Collaborator

that looks right

(.venv) marcogorelli@DESKTOP-U8OKFP3:~/scratch$ cat t.py
import polars as pl

def func(x):
    print('batch is: ', x)
    return x

df = pl.DataFrame({'group': ['a', 'a', 'b'], 'value': [1, 2, 3]})

df.select(pl.col('value').map_batches(func))
df.select(pl.col('value').map_batches(func).over('group'))
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/scratch$ python t.py
batch is:  shape: (3,)
Series: 'value' [i64]
[
        1
        2
        3
]
batch is:  shape: (2,)
Series: '' [i64]
[
        1
        2
]
batch is:  shape: (1,)
Series: '' [i64]
[
        3
]

@itamarst
Copy link
Contributor Author

itamarst commented Feb 28, 2024

If that's the case, my first inclination is to not document map_elements() at all in the user guide, nor refer to it in API docs for map_batches()? Since the fact map_elements() sometimes takes a single element and sometimes takes a whole Series seems problematic. Or am I missing something?

@cmdlineluser
Copy link
Contributor

@itamarst
Copy link
Contributor Author

I guess the complexity argument in #14521 suggests one should only use map_batches() for the UDF docs, and as suggested in #14521 document the different APIs in its whole own document.

@deanm0000
Copy link
Collaborator

If you exclude groups and you have func which takes a python scaler and returns a scaler then

then

df.select(pl.col('a').map_elements(func))

Is mostly there same as

df.select(pl.col('a')
.map_batches(lambda x : (
pl.Series([func(y) for y in x])
))
)

So it's just a convenience shortcut really.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants