docs(python): Expand NumPy section in the user guide with `ufunc` info #13392

deanm0000 · 2024-01-02T19:51:29Z

scipy.special is (almost) all ufuncs so I provided a link there.

numba creates ufuncs so I provided a link to its page and an example thereof.

Lastly, added a note about map_batches avoidance. For the longest time I thought avoiding map_batches was a way to improve performance until I stumbled on to this which shows it just calls map_batches anyway.

MarcoGorelli · 2024-01-02T20:10:18Z

docs/user-guide/expressions/numpy.md


 Polars expressions support NumPy [ufuncs](https://numpy.org/doc/stable/reference/ufuncs.html). See [here](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs)
-for a list on all supported numpy functions.
+for a list on all supported numpy functions. Additionally, SciPy offers a wide host of ufuncs. Specifically, the [scipy.special](https://docs.scipy.org/doc/scipy/reference/special.html#module-scipy.special) namespace has ufunc versions of many (possibly most) of what is available under stats. 


so glad you've opened a PR about this 🙌

I think the lint failure is just because you have a trailing space at the end of this line

I guess I'll fix this when numba can support the latest python version as that failed too.

deanm0000 · 2024-01-02T20:46:17Z

looks like numba doesn't work on python 3.12 yet so my numba example won't work unless the example can be run on a lower version of python.

MarcoGorelli · 2024-01-03T17:10:33Z

looks like it's coming soon - tbh I'd suggest just waiting numba/numba#9197

deanm0000 · 2024-01-03T20:27:14Z

Yeah I saw that and back burnered it unless/until something happens

stinodego · 2024-01-03T22:13:45Z

For examples that can't run during docs building (e.g. downloading stuff from the Cloud), you can enclose them in triple quotes. See this file for an example:
https://github.com/pola-rs/polars/blob/main/docs/src/python/user-guide/lazy/execution.py

henryharbeck · 2024-02-06T11:42:55Z

Hi @deanm0000 and @MarcoGorelli
Numba now supports python 3.12 since their 0.59.0 release.
Release notes are here.
It seems like there are still a few post-relase tasks to go, such as creating a github release (ref: numba/numba#9410), but I believe that this PR would now be unblocked.

deanm0000 · 2024-02-06T18:36:09Z

Hey thanks for the heads up. I'm working on another ufunc change so I'll probably sit on this unless/until that gets merged so that I don't duplicate too much work.

Here's the aforementioned PR #14328

MarcoGorelli · 2024-02-15T22:20:57Z

docs/src/python/user-guide/expressions/numba-example.py

+import polars as pl
+import numba as nb
+
+df = pl.DataFrame({"a": [10, 9, 8, 7]})
+
+
+@nb.guvectorize([(nb.int64[:], nb.int64, nb.int64[:])], "(n),()->(n)")
+def cum_sum_reset(x, y, res):
+    res[0] = x[0]
+    for i in range(1, x.shape[0]):
+        res[i] = x[i] + res[i - 1]
+        if res[i] >= y:
+            res[i] = x[i]
+
+
+out = df.select(cum_sum_reset(pl.all(), 5))
+print(out)


I'm confused by this example - for these particular numbers, they're all above the threshold, so wouldn't the result just be the same as the input? maybe run it with cum_sum_reset(pl.all(), 30)?

MarcoGorelli · 2024-02-15T22:26:31Z

docs/user-guide/expressions/numpy.md

+
+### Note on Performance
+
+The speed of ufuncs comes from being vectorized, and compiled. That said, there's no inherent benefit in using ufuncs just to avoid the use of `map_batches`. As mentioned above, ufuncs use a hook which gives polars the opportunity to run its own code before the ufunc is executed. In that way polars is still executing the ufunc with `map_batches`.


true, but map_batches has is_elementwise=False by default, and so will do the expected thing in group-by / over

maybe, rather than avoiding map_batches, map_batches should be the primary way the ufuncs are taught?

@MarcoGorelli

It turns out that all the numpy and scipy ufuncs are element wise in the sense that they aren't aggregations. Where this becomes important is if someone wants to do a sum and then the np.expm1 function. The ufuncarray hook will do is_elementwise=true. If it's anumba ufunc then it won't.

Whether people should be taught to not use the hook is more philosophical imo. I suppose there's slightly less parsing in that case so would be technically more performant but by using it, it's a nice syntax, imo.

MarcoGorelli · 2024-03-20T20:54:28Z

docs/user-guide/expressions/user-defined-functions.md

-mean that the `Series` represents a column in a `DataFrame`. Note that in the `group_by` context, that column is not yet
-aggregated!
+mean that the `Series` represents a column in a `DataFrame`. To be clear, **using a `group_by` or `over` with `map_batches` will return results as though there was no group at all.**


this isn't true anymore, is it?

In [18]: df.with_columns(result=pl.col('b').map_batches(lambda x: np.cumsum(x)).over('a')) Out[18]: shape: (3, 3) ┌─────┬─────┬────────┐ │ a ┆ b ┆ result │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪════════╡ │ 1 ┆ 4 ┆ 4 │ │ 1 ┆ 5 ┆ 9 │ │ 2 ┆ 6 ┆ 6 │ └─────┴─────┴────────┘

You're right. I've lost quite a bit of steam on this.

MarcoGorelli

Hey

Sorry, looking at this again, and to be honest I'm not totally sold that numba should be advertised in the users guide until the null value handling is sorted out (#14811)

Otherwise results like this

import polars as pl
import numba as nb

df = pl.DataFrame({"a": [40, 39, None, 37]}) - 30


@nb.guvectorize([(nb.int64[:], nb.int64, nb.int64[:])], "(n),()->(n)")
def cum_sum_reset(x, y, res):
    res[0] = x[0]
    for i in range(1, x.shape[0]):
        res[i] = x[i] + res[i - 1]
        if res[i] >= y:
            res[i] = x[i]


out = df.with_columns(result=cum_sum_reset(pl.all(), 30))
print(out)

import polars as pl
import numba as nb

df = pl.DataFrame({"a": [40, 39, None, 37]}) - 30


@nb.guvectorize([(nb.int64[:], nb.int64, nb.int64[:])], "(n),()->(n)")
def cum_sum_reset(x, y, res):
    res[0] = x[0]
    for i in range(1, x.shape[0]):
        res[i] = x[i] + res[i - 1]
        if res[i] >= y:
            res[i] = x[i]


out = df.with_columns(result=cum_sum_reset(pl.all(), 30))
print(out)

look a bit scary

itamarst · 2024-04-04T12:08:47Z

#15194 is my attempt at dealing with this document.

deanm0000 · 2024-04-04T13:29:50Z

I'll defer

deanm0000 added 2 commits January 2, 2024 18:26

strong warning about map_batches

a510ad1

docs(python): add numba info/example

904a999

deanm0000 requested review from ritchie46, stinodego, orlp, c-peters, alexander-beedie and MarcoGorelli as code owners January 2, 2024 19:51

github-actions bot added documentation Improvements or additions to documentation python Related to Python Polars labels Jan 2, 2024

MarcoGorelli reviewed Jan 2, 2024

View reviewed changes

deanm0000 mentioned this pull request Feb 6, 2024

fix(python): fix ufunc for unlimited column args #14328

Merged

deanm0000 and others added 8 commits February 13, 2024 15:30

ufunc update

1812f7f

fmt

8a29f25

more formatting

ce36292

requirements

583bd83

both req

b5ff692

Merge branch 'main' into ufuncs

db28fec

lowercase

67141a9

more_req

56e443c

stinodego changed the title ~~docs(python): add ufunc info~~ docs(python): Expand numpy section in the user guide Feb 15, 2024

stinodego changed the title ~~docs(python): Expand numpy section in the user guide~~ docs(python): Expand numpy section in the user guide with ufunc info Feb 15, 2024

stinodego changed the title ~~docs(python): Expand numpy section in the user guide with ufunc info~~ docs(python): Expand NumPy section in the user guide with ufunc info Feb 15, 2024

MarcoGorelli reviewed Feb 15, 2024

View reviewed changes

stinodego mentioned this pull request Feb 27, 2024

User-defined functions documentation doesn't tell reader how to write fast functions #14699

Closed

MarcoGorelli reviewed Mar 20, 2024

View reviewed changes

MarcoGorelli requested changes Apr 4, 2024

View reviewed changes

deanm0000 closed this Apr 4, 2024

deanm0000 deleted the ufuncs branch August 26, 2024 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(python): Expand NumPy section in the user guide with `ufunc` info #13392

docs(python): Expand NumPy section in the user guide with `ufunc` info #13392

deanm0000 commented Jan 2, 2024

MarcoGorelli Jan 2, 2024

deanm0000 Jan 2, 2024

deanm0000 commented Jan 2, 2024

MarcoGorelli commented Jan 3, 2024

deanm0000 commented Jan 3, 2024

stinodego commented Jan 3, 2024

henryharbeck commented Feb 6, 2024

deanm0000 commented Feb 6, 2024 •

edited

Loading

MarcoGorelli Feb 15, 2024

MarcoGorelli Feb 15, 2024

deanm0000 Feb 18, 2024

MarcoGorelli Mar 20, 2024 •

edited

Loading

deanm0000 Mar 21, 2024

MarcoGorelli left a comment

itamarst commented Apr 4, 2024

deanm0000 commented Apr 4, 2024


		### Note on Performance

		The speed of ufuncs comes from being vectorized, and compiled. That said, there's no inherent benefit in using ufuncs just to avoid the use of `map_batches`. As mentioned above, ufuncs use a hook which gives polars the opportunity to run its own code before the ufunc is executed. In that way polars is still executing the ufunc with `map_batches`.

docs(python): Expand NumPy section in the user guide with ufunc info #13392

docs(python): Expand NumPy section in the user guide with ufunc info #13392

Conversation

deanm0000 commented Jan 2, 2024

MarcoGorelli Jan 2, 2024

Choose a reason for hiding this comment

deanm0000 Jan 2, 2024

Choose a reason for hiding this comment

deanm0000 commented Jan 2, 2024

MarcoGorelli commented Jan 3, 2024

deanm0000 commented Jan 3, 2024

stinodego commented Jan 3, 2024

henryharbeck commented Feb 6, 2024

deanm0000 commented Feb 6, 2024 • edited Loading

MarcoGorelli Feb 15, 2024

Choose a reason for hiding this comment

MarcoGorelli Feb 15, 2024

Choose a reason for hiding this comment

deanm0000 Feb 18, 2024

Choose a reason for hiding this comment

MarcoGorelli Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

deanm0000 Mar 21, 2024

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

itamarst commented Apr 4, 2024

deanm0000 commented Apr 4, 2024

docs(python): Expand NumPy section in the user guide with `ufunc` info #13392

docs(python): Expand NumPy section in the user guide with `ufunc` info #13392

deanm0000 commented Feb 6, 2024 •

edited

Loading

MarcoGorelli Mar 20, 2024 •

edited

Loading