Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easier to run aggregations over nested elements in nf.eval, nf.query and nf.nested.nest #155

Open
2 of 3 tasks
hombit opened this issue Oct 16, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@hombit
Copy link
Collaborator

hombit commented Oct 16, 2024

Feature request

Today, we have these ways to aggregate a single nested column values:

  • nf.reduce(np.mean, "lc.mag") - good, but not cheap and requires to join the output back to the frame
  • nf.eval("lc.mag.groupby(by=lc.mag.index).mean()") - expansive and not intuitive

It would be nice if we can develop an easier way of doing such aggregations. Options I see:

  1. Currently, we can do nf.eval("lc.mag.mean()") / nf["lc.mag"].mean(), but it would output the aggregation over all the flat values, which is, especially in the first case, not intuitive. We can redefine it.
  2. Add special interface for nested aggregations with .nest accessor, e.g. nf.lc.nest.mean() would return nf.shape[0] mean values.
  3. Add special methods which would work in eval/query environment only, e.g. nf.eval("lc.mag.nest_mean()")

However I'm not sure how we'd make all these performant, it looks like pyarrow provides almost zero tooling for that. Maybe we can use things like numpy.ufunc.reduceat and scipy.ndimage.mean.

Before submitting
Please check the following:

  • I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
  • I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
  • If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.
@hombit hombit added the enhancement New feature or request label Oct 16, 2024
@hombit
Copy link
Collaborator Author

hombit commented Oct 18, 2024

Some motivating benchmarks

from nested_pandas.datasets import generate_data

nf = generate_data(10_000, 1000)

%timeit nf.reduce(np.mean, 'nested.flux')
# 43.3 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

flux = pa.array(nf['nested']).field('flux')  # this is fast, ~ 5μs
%timeit np.add.reduceat(flux.values, flux.offsets[:-1]) / np.diff(flux.offsets)
1.92 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant