-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(python): Expand NumPy section in the user guide with ufunc
info
#13392
Conversation
docs/user-guide/expressions/numpy.md
Outdated
|
||
Polars expressions support NumPy [ufuncs](https://numpy.org/doc/stable/reference/ufuncs.html). See [here](https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs) | ||
for a list on all supported numpy functions. | ||
for a list on all supported numpy functions. Additionally, SciPy offers a wide host of ufuncs. Specifically, the [scipy.special](https://docs.scipy.org/doc/scipy/reference/special.html#module-scipy.special) namespace has ufunc versions of many (possibly most) of what is available under stats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so glad you've opened a PR about this 🙌
I think the lint failure is just because you have a trailing space at the end of this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'll fix this when numba can support the latest python version as that failed too.
looks like numba doesn't work on python 3.12 yet so my numba example won't work unless the example can be run on a lower version of python. |
looks like it's coming soon - tbh I'd suggest just waiting numba/numba#9197 |
Yeah I saw that and back burnered it unless/until something happens |
For examples that can't run during docs building (e.g. downloading stuff from the Cloud), you can enclose them in triple quotes. See this file for an example: |
Hi @deanm0000 and @MarcoGorelli |
Hey thanks for the heads up. I'm working on another ufunc change so I'll probably sit on this unless/until that gets merged so that I don't duplicate too much work. Here's the aforementioned PR #14328 |
numpy
section in the user guide
numpy
section in the user guidenumpy
section in the user guide with ufunc
info
numpy
section in the user guide with ufunc
infoufunc
info
import polars as pl | ||
import numba as nb | ||
|
||
df = pl.DataFrame({"a": [10, 9, 8, 7]}) | ||
|
||
|
||
@nb.guvectorize([(nb.int64[:], nb.int64, nb.int64[:])], "(n),()->(n)") | ||
def cum_sum_reset(x, y, res): | ||
res[0] = x[0] | ||
for i in range(1, x.shape[0]): | ||
res[i] = x[i] + res[i - 1] | ||
if res[i] >= y: | ||
res[i] = x[i] | ||
|
||
|
||
out = df.select(cum_sum_reset(pl.all(), 5)) | ||
print(out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this example - for these particular numbers, they're all above the threshold, so wouldn't the result just be the same as the input? maybe run it with cum_sum_reset(pl.all(), 30)
?
|
||
### Note on Performance | ||
|
||
The speed of ufuncs comes from being vectorized, and compiled. That said, there's no inherent benefit in using ufuncs just to avoid the use of `map_batches`. As mentioned above, ufuncs use a hook which gives polars the opportunity to run its own code before the ufunc is executed. In that way polars is still executing the ufunc with `map_batches`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, but map_batches
has is_elementwise=False
by default, and so will do the expected thing in group-by / over
maybe, rather than avoiding map_batches
, map_batches
should be the primary way the ufuncs are taught?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that all the numpy and scipy ufuncs are element wise in the sense that they aren't aggregations. Where this becomes important is if someone wants to do a sum and then the np.expm1 function. The ufuncarray hook will do is_elementwise=true. If it's anumba ufunc then it won't.
Whether people should be taught to not use the hook is more philosophical imo. I suppose there's slightly less parsing in that case so would be technically more performant but by using it, it's a nice syntax, imo.
mean that the `Series` represents a column in a `DataFrame`. Note that in the `group_by` context, that column is not yet | ||
aggregated! | ||
mean that the `Series` represents a column in a `DataFrame`. To be clear, **using a `group_by` or `over` with `map_batches` will return results as though there was no group at all.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't true anymore, is it?
In [18]: df.with_columns(result=pl.col('b').map_batches(lambda x: np.cumsum(x)).over('a'))
Out[18]:
shape: (3, 3)
┌─────┬─────┬────────┐
│ a ┆ b ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪════════╡
│ 1 ┆ 4 ┆ 4 │
│ 1 ┆ 5 ┆ 9 │
│ 2 ┆ 6 ┆ 6 │
└─────┴─────┴────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I've lost quite a bit of steam on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey
Sorry, looking at this again, and to be honest I'm not totally sold that numba should be advertised in the users guide until the null value handling is sorted out (#14811)
Otherwise results like this
import polars as pl
import numba as nb
df = pl.DataFrame({"a": [40, 39, None, 37]}) - 30
@nb.guvectorize([(nb.int64[:], nb.int64, nb.int64[:])], "(n),()->(n)")
def cum_sum_reset(x, y, res):
res[0] = x[0]
for i in range(1, x.shape[0]):
res[i] = x[i] + res[i - 1]
if res[i] >= y:
res[i] = x[i]
out = df.with_columns(result=cum_sum_reset(pl.all(), 30))
print(out)
import polars as pl
import numba as nb
df = pl.DataFrame({"a": [40, 39, None, 37]}) - 30
@nb.guvectorize([(nb.int64[:], nb.int64, nb.int64[:])], "(n),()->(n)")
def cum_sum_reset(x, y, res):
res[0] = x[0]
for i in range(1, x.shape[0]):
res[i] = x[i] + res[i - 1]
if res[i] >= y:
res[i] = x[i]
out = df.with_columns(result=cum_sum_reset(pl.all(), 30))
print(out)
look a bit scary
#15194 is my attempt at dealing with this document. |
I'll defer |
scipy.special is (almost) all ufuncs so I provided a link there.
numba creates ufuncs so I provided a link to its page and an example thereof.
Lastly, added a note about map_batches avoidance. For the longest time I thought avoiding map_batches was a way to improve performance until I stumbled on to this which shows it just calls map_batches anyway.