Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear how to manually do column projection with uproot.dask (and API differences with dask-awkward) #1349

Closed
pfackeldey opened this issue Dec 12, 2024 · 9 comments
Labels
question Open-ended questions from users

Comments

@pfackeldey
Copy link
Collaborator

pfackeldey commented Dec 12, 2024

I'm currently looking into adjusting the dask graph layer for the IO to only read a given list of provided columns.

With uproot.dask this looks as follows:

import uproot

io = uproot.dask({"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root": "Events"})
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x1670b73d0>
#  0. from-uproot-138b384738005b2a7a7eefbb600ca6c2

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_keys(frozenset(["nJet"]))

# now compute, should only load `nJet`
io.compute()
# ... TypeError: PlaceholderArray supports only trivial slices, not int

(I have the impression that the underlying form is not updated accordingly here, or I'm using the projection interface wrongly?)

If I do this with parquet instead though, it works:

import dask_awkward as dak

io = dak.from_parquet("https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.parquet")
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x15c4ce1a0>
# 0. from-parquet-150809c2f6f63708200b7f130d3a395d

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_columns(frozenset(["nJet"]))
  
# now compute, should only load `nJet`
io.compute()
# <Array [{nJet: 5}, {nJet: 8}, ..., {...}, {nJet: 2}] type='40 * {nJet: uint32}'>

I don't understand why the above code example works for dak.from_parquet, but not for uproot.dask, there seems to be a real difference in how the column projection is implemented for the io_func of the dask layer.

Apart from that, the APIs are very similar but also a bit misaligned between uproot vs dask-awkward (probably due to historic reasons), e.g.:

There are probably some more that I've not yet encountered.

In principle, it would be nice if uproot.dask would adhere to the protocols defined here: https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py, to eliminate these differences. Some of this seems to be duplicated code in uproot._dask aswell.

I'm currently trying to find a way to unify the APIs and to find the reason of this difference here.
I'd appreciate any input how this should work/behave and how we can ensure that the APIs won't diverge in the future.

(If this API would be unified it would be rather easy to make dak.project_columns possible for all AwkwardInputLayer kinds.)

@pfackeldey pfackeldey added docs Improvements or additions to documentation question Open-ended questions from users and removed docs Improvements or additions to documentation labels Dec 12, 2024
@pfackeldey
Copy link
Collaborator Author

(pinging @lgray @agoose77 @kkothari2001 for ideas, input and help 🙏 )

@martindurant
Copy link

There are different conventions for how the columns are named, and uproot encodes extra things in these names (because some columns are always required even when not in the output).

The correct place to call project is probably on the layer, not the IO function, which provides a place to override the "awkward" column name to the "io convention" names.

(This is something that the one-pass PR explicitly worked around, removing a lot of protocol classes and code in the process)

@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Dec 13, 2024

Thanks for the clarification.
In current dask-awkward one would need to pass a report and state to layer.project(), which is far from user-friendly, especially since state is different for uproot.dask and dak.from_*.
So that's a really good thing about one-pass optimization that it can accept directly the column names inferred by dak.necessary_columns!

@agoose77
Copy link
Collaborator

agoose77 commented Dec 13, 2024

@pfackeldey not much time answer in full here, but the different column projection conventions were intentional -- it reflects the different concepts of "column" between uproot, parquet, and form remapping!

I would suggest not trying to remove that separation; it is a problem with the one-pass PR that tried to do so.

Ultimately, column optimisation is really "Buffer Optimisation", and is a black-box for each array source.

Will try to get to this.

@pfackeldey
Copy link
Collaborator Author

Thank you @agoose77 for your reply!
Ok, I understand that the concept of a "column" is different for uproot, parquet, etc., which is a good reason for the API differences.

I'd argue though that:

  1. my example written in this issue should not fail for uproot
  2. there should be a way to do the column projection with a list of strings (list of columns) for any kind of projectable IO layer (any format-specific complexity can be hidden inside the io_func/io_layer)

@agoose77
Copy link
Collaborator

agoose77 commented Dec 13, 2024

Apologies for terse replies: I'm in a meeting!

(1) -- on the face of it, the Parquet example surprises me -- it's actually changing the type -- it should fail for uproot because the repr is trying to view missing arrays. n.b., we talked about making placeholder reprs not throw errors.
(2) -- I think there's a naming problem -- the superset of optimisations is "buffer optimisation", so that should be the core API. If we want to support some notion of "column optimisation", it would need to sit on that.

@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Jan 14, 2025

I realized the repr failure has been fixed in scikit-hep/awkward#3342.

Rerunning the uproot.dask example of this issue yields:

import uproot

io = uproot.dask({"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root": "Events"})
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x1670b73d0>
#  0. from-uproot-138b384738005b2a7a7eefbb600ca6c2

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_keys(frozenset(["nJet"]))

# now compute, should only load `nJet`
array = io.compute()
print(array)
# <Array [{run: ??, ...}, ..., {run: ??, ...}] type='40 * {run: uint32, lumin...'>

print(array.nJet)
# <Array [5, 8, 5, 3, 5, 8, 4, 4, ..., 4, 9, 3, 2, 3, 1, 6, 2] type='40 * uint32'>

So this works now nicely, where all fields are placeholder arrays, except for the one that I asked to be loaded, great!

That means that dak.from_parquet and uproot.dask behave the same way in terms of data loading now. The difference is the resulting array, where in the case of parquet the form is pruned down to only the IO-source column that has been asked for, and in the case of uproot the form is not pruned and every "leaf" in the layout (that does not correspond to "nJet") is a PlaceholderArray.

This opens up now the possibility to add a function that accepts a set of IO-source columns to do the column projection "manually", which is what I was looking for in the very beginning.

Apart from that, the difference the resulting form of dak.from_parquet and uproot.dask (pruned vs placeholder arrays) is not fully clear yet to me. Could someone elaborate on this a bit more why e.g. this pruning is necessary for parquet, maybe @agoose77, @martindurant or @jpivarski ?

(This issue can be closed now, I'll leave it open in case someone would like to comment on the above mentioned difference. Otherwise feel free to close it!)

@agoose77
Copy link
Collaborator

@pfackeldey the need for unproject_layout is that the interface exposed by the arrow libraries only permits us to drop columns (and not read them). Then we end up with a new form from arrow. We implemented unproject_layout before placeholders existed IIRC, but ultimately it's just figuring out how to coerce the read data to a given (expected) form.

Meanwhile, for uproot our reading is "stable" such that partial reads don't change the underlying form.

I think a point of confusion here is that project_columns is not the same thing as the column optimisation — it naively reads only the required columns and doesn't do any unprojection, IIRC.

@ianna
Copy link
Collaborator

ianna commented Jan 14, 2025

@pfackeldey - I wonder if it's better to convert/move the issue to a discussion?

@scikit-hep scikit-hep locked and limited conversation to collaborators Jan 14, 2025
@ianna ianna converted this issue into discussion #1361 Jan 14, 2025

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Open-ended questions from users
Projects
None yet
Development

No branches or pull requests

4 participants