Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Expand NumPy section in the user guide with ufunc info #13392

Closed
wants to merge 10 commits into from

Conversation

deanm0000
Copy link
Collaborator

scipy.special is (almost) all ufuncs so I provided a link there.

numba creates ufuncs so I provided a link to its page and an example thereof.

Lastly, added a note about map_batches avoidance. For the longest time I thought avoiding map_batches was a way to improve performance until I stumbled on to this which shows it just calls map_batches anyway.

@github-actions github-actions bot added documentation Improvements or additions to documentation python Related to Python Polars labels Jan 2, 2024

Polars expressions support NumPy [ufuncs](https://numpy.org/doc/stable/reference/ufuncs.html). See [here](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs)
for a list on all supported numpy functions.
for a list on all supported numpy functions. Additionally, SciPy offers a wide host of ufuncs. Specifically, the [scipy.special](https://docs.scipy.org/doc/scipy/reference/special.html#module-scipy.special) namespace has ufunc versions of many (possibly most) of what is available under stats.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so glad you've opened a PR about this 🙌

I think the lint failure is just because you have a trailing space at the end of this line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'll fix this when numba can support the latest python version as that failed too.

@deanm0000
Copy link
Collaborator Author

looks like numba doesn't work on python 3.12 yet so my numba example won't work unless the example can be run on a lower version of python.

@MarcoGorelli
Copy link
Collaborator

looks like it's coming soon - tbh I'd suggest just waiting numba/numba#9197

@deanm0000
Copy link
Collaborator Author

Yeah I saw that and back burnered it unless/until something happens

@stinodego
Copy link
Member

For examples that can't run during docs building (e.g. downloading stuff from the Cloud), you can enclose them in triple quotes. See this file for an example:
https://github.com/pola-rs/polars/blob/main/docs/src/python/user-guide/lazy/execution.py

@henryharbeck
Copy link
Contributor

Hi @deanm0000 and @MarcoGorelli
Numba now supports python 3.12 since their 0.59.0 release.
Release notes are here.
It seems like there are still a few post-relase tasks to go, such as creating a github release (ref: numba/numba#9410), but I believe that this PR would now be unblocked.

@deanm0000
Copy link
Collaborator Author

deanm0000 commented Feb 6, 2024

Hey thanks for the heads up. I'm working on another ufunc change so I'll probably sit on this unless/until that gets merged so that I don't duplicate too much work.

Here's the aforementioned PR #14328

@stinodego stinodego changed the title docs(python): add ufunc info docs(python): Expand numpy section in the user guide Feb 15, 2024
@stinodego stinodego changed the title docs(python): Expand numpy section in the user guide docs(python): Expand numpy section in the user guide with ufunc info Feb 15, 2024
@stinodego stinodego changed the title docs(python): Expand numpy section in the user guide with ufunc info docs(python): Expand NumPy section in the user guide with ufunc info Feb 15, 2024
Comment on lines +1 to +17
import polars as pl
import numba as nb

df = pl.DataFrame({"a": [10, 9, 8, 7]})


@nb.guvectorize([(nb.int64[:], nb.int64, nb.int64[:])], "(n),()->(n)")
def cum_sum_reset(x, y, res):
res[0] = x[0]
for i in range(1, x.shape[0]):
res[i] = x[i] + res[i - 1]
if res[i] >= y:
res[i] = x[i]


out = df.select(cum_sum_reset(pl.all(), 5))
print(out)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this example - for these particular numbers, they're all above the threshold, so wouldn't the result just be the same as the input? maybe run it with cum_sum_reset(pl.all(), 30)?


### Note on Performance

The speed of ufuncs comes from being vectorized, and compiled. That said, there's no inherent benefit in using ufuncs just to avoid the use of `map_batches`. As mentioned above, ufuncs use a hook which gives polars the opportunity to run its own code before the ufunc is executed. In that way polars is still executing the ufunc with `map_batches`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, but map_batches has is_elementwise=False by default, and so will do the expected thing in group-by / over

maybe, rather than avoiding map_batches, map_batches should be the primary way the ufuncs are taught?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli

It turns out that all the numpy and scipy ufuncs are element wise in the sense that they aren't aggregations. Where this becomes important is if someone wants to do a sum and then the np.expm1 function. The ufuncarray hook will do is_elementwise=true. If it's anumba ufunc then it won't.

Whether people should be taught to not use the hook is more philosophical imo. I suppose there's slightly less parsing in that case so would be technically more performant but by using it, it's a nice syntax, imo.

mean that the `Series` represents a column in a `DataFrame`. Note that in the `group_by` context, that column is not yet
aggregated!
mean that the `Series` represents a column in a `DataFrame`. To be clear, **using a `group_by` or `over` with `map_batches` will return results as though there was no group at all.**
Copy link
Collaborator

@MarcoGorelli MarcoGorelli Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't true anymore, is it?

In [18]: df.with_columns(result=pl.col('b').map_batches(lambda x: np.cumsum(x)).over('a'))
Out[18]:
shape: (3, 3)
┌─────┬─────┬────────┐
│ abresult │
│ ---------    │
│ i64i64i64    │
╞═════╪═════╪════════╡
│ 144      │
│ 159      │
│ 266      │
└─────┴─────┴────────┘

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I've lost quite a bit of steam on this.

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey

Sorry, looking at this again, and to be honest I'm not totally sold that numba should be advertised in the users guide until the null value handling is sorted out (#14811)

Otherwise results like this

import polars as pl
import numba as nb

df = pl.DataFrame({"a": [40, 39, None, 37]}) - 30


@nb.guvectorize([(nb.int64[:], nb.int64, nb.int64[:])], "(n),()->(n)")
def cum_sum_reset(x, y, res):
    res[0] = x[0]
    for i in range(1, x.shape[0]):
        res[i] = x[i] + res[i - 1]
        if res[i] >= y:
            res[i] = x[i]


out = df.with_columns(result=cum_sum_reset(pl.all(), 30))
print(out)
import polars as pl
import numba as nb

df = pl.DataFrame({"a": [40, 39, None, 37]}) - 30


@nb.guvectorize([(nb.int64[:], nb.int64, nb.int64[:])], "(n),()->(n)")
def cum_sum_reset(x, y, res):
    res[0] = x[0]
    for i in range(1, x.shape[0]):
        res[i] = x[i] + res[i - 1]
        if res[i] >= y:
            res[i] = x[i]


out = df.with_columns(result=cum_sum_reset(pl.all(), 30))
print(out)

look a bit scary

@itamarst
Copy link
Contributor

itamarst commented Apr 4, 2024

#15194 is my attempt at dealing with this document.

@deanm0000
Copy link
Collaborator Author

I'll defer

@deanm0000 deanm0000 closed this Apr 4, 2024
@deanm0000 deanm0000 deleted the ufuncs branch August 26, 2024 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants