Unclear how to manually do column projection with `uproot.dask` (and API differences with `dask-awkward`) #1349

pfackeldey · 2024-12-12T23:27:53Z

I'm currently looking into adjusting the dask graph layer for the IO to only read a given list of provided columns.

With uproot.dask this looks as follows:

import uproot

io = uproot.dask({"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root": "Events"})
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x1670b73d0>
#  0. from-uproot-138b384738005b2a7a7eefbb600ca6c2

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_keys(frozenset(["nJet"]))

# now compute, should only load `nJet`
io.compute()
# ... TypeError: PlaceholderArray supports only trivial slices, not int

(I have the impression that the underlying form is not updated accordingly here, or I'm using the projection interface wrongly?)

If I do this with parquet instead though, it works:

import dask_awkward as dak

io = dak.from_parquet("https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.parquet")
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x15c4ce1a0>
# 0. from-parquet-150809c2f6f63708200b7f130d3a395d

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_columns(frozenset(["nJet"]))
  
# now compute, should only load `nJet`
io.compute()
# <Array [{nJet: 5}, {nJet: 8}, ..., {...}, {nJet: 2}] type='40 * {nJet: uint32}'>

I don't understand why the above code example works for dak.from_parquet, but not for uproot.dask, there seems to be a real difference in how the column projection is implemented for the io_func of the dask layer.

Apart from that, the APIs are very similar but also a bit misaligned between uproot vs dask-awkward (probably due to historic reasons), e.g.:

.project_keys() vs .project_columns()
form_with_unique_keys argument '<root>' vs '@'
the state that holds the information of the trace is constructed differently: https://github.com/scikit-hep/uproot5/blob/main/src/uproot/_dask.py#L1082-L1084 vs https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py#L104

There are probably some more that I've not yet encountered.

In principle, it would be nice if uproot.dask would adhere to the protocols defined here: https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/io/columnar.py, to eliminate these differences. Some of this seems to be duplicated code in uproot._dask aswell.

I'm currently trying to find a way to unify the APIs and to find the reason of this difference here.
I'd appreciate any input how this should work/behave and how we can ensure that the APIs won't diverge in the future.

(If this API would be unified it would be rather easy to make dak.project_columns possible for all AwkwardInputLayer kinds.)

The text was updated successfully, but these errors were encountered:

pfackeldey · 2024-12-12T23:36:52Z

(pinging @lgray @agoose77 @kkothari2001 for ideas, input and help 🙏 )

martindurant · 2024-12-13T14:52:00Z

There are different conventions for how the columns are named, and uproot encodes extra things in these names (because some columns are always required even when not in the output).

The correct place to call project is probably on the layer, not the IO function, which provides a place to override the "awkward" column name to the "io convention" names.

(This is something that the one-pass PR explicitly worked around, removing a lot of protocol classes and code in the process)

pfackeldey · 2024-12-13T20:36:53Z

Thanks for the clarification.
In current dask-awkward one would need to pass a report and state to layer.project(), which is far from user-friendly, especially since state is different for uproot.dask and dak.from_*.
So that's a really good thing about one-pass optimization that it can accept directly the column names inferred by dak.necessary_columns!

agoose77 · 2024-12-13T20:54:28Z

@pfackeldey not much time answer in full here, but the different column projection conventions were intentional -- it reflects the different concepts of "column" between uproot, parquet, and form remapping!

I would suggest not trying to remove that separation; it is a problem with the one-pass PR that tried to do so.

Ultimately, column optimisation is really "Buffer Optimisation", and is a black-box for each array source.

Will try to get to this.

pfackeldey · 2024-12-13T21:02:47Z

Thank you @agoose77 for your reply!
Ok, I understand that the concept of a "column" is different for uproot, parquet, etc., which is a good reason for the API differences.

I'd argue though that:

my example written in this issue should not fail for uproot
there should be a way to do the column projection with a list of strings (list of columns) for any kind of projectable IO layer (any format-specific complexity can be hidden inside the io_func/io_layer)

agoose77 · 2024-12-13T21:47:26Z

Apologies for terse replies: I'm in a meeting!

(1) -- on the face of it, the Parquet example surprises me -- it's actually changing the type -- it should fail for uproot because the repr is trying to view missing arrays. n.b., we talked about making placeholder reprs not throw errors.
(2) -- I think there's a naming problem -- the superset of optimisations is "buffer optimisation", so that should be the core API. If we want to support some notion of "column optimisation", it would need to sit on that.

pfackeldey · 2025-01-14T15:14:10Z

I realized the repr failure has been fixed in scikit-hep/awkward#3342.

Rerunning the uproot.dask example of this issue yields:

import uproot

io = uproot.dask({"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root": "Events"})
print(io.dask)
# HighLevelGraph with 1 layers.
# <dask.highlevelgraph.HighLevelGraph object at 0x1670b73d0>
#  0. from-uproot-138b384738005b2a7a7eefbb600ca6c2

# update the IO layer in-place, i.e. do column projection
for lay in io.dask.layers.values():
  lay.io_func = lay.io_func.project_keys(frozenset(["nJet"]))

# now compute, should only load `nJet`
array = io.compute()
print(array)
# <Array [{run: ??, ...}, ..., {run: ??, ...}] type='40 * {run: uint32, lumin...'>

print(array.nJet)
# <Array [5, 8, 5, 3, 5, 8, 4, 4, ..., 4, 9, 3, 2, 3, 1, 6, 2] type='40 * uint32'>

So this works now nicely, where all fields are placeholder arrays, except for the one that I asked to be loaded, great!

That means that dak.from_parquet and uproot.dask behave the same way in terms of data loading now. The difference is the resulting array, where in the case of parquet the form is pruned down to only the IO-source column that has been asked for, and in the case of uproot the form is not pruned and every "leaf" in the layout (that does not correspond to "nJet") is a PlaceholderArray.

This opens up now the possibility to add a function that accepts a set of IO-source columns to do the column projection "manually", which is what I was looking for in the very beginning.

Apart from that, the difference the resulting form of dak.from_parquet and uproot.dask (pruned vs placeholder arrays) is not fully clear yet to me. Could someone elaborate on this a bit more why e.g. this pruning is necessary for parquet, maybe @agoose77, @martindurant or @jpivarski ?

(This issue can be closed now, I'll leave it open in case someone would like to comment on the above mentioned difference. Otherwise feel free to close it!)

agoose77 · 2025-01-14T15:27:36Z

@pfackeldey the need for unproject_layout is that the interface exposed by the arrow libraries only permits us to drop columns (and not read them). Then we end up with a new form from arrow. We implemented unproject_layout before placeholders existed IIRC, but ultimately it's just figuring out how to coerce the read data to a given (expected) form.

Meanwhile, for uproot our reading is "stable" such that partial reads don't change the underlying form.

I think a point of confusion here is that project_columns is not the same thing as the column optimisation — it naively reads only the required columns and doesn't do any unprojection, IIRC.

ianna · 2025-01-14T15:42:54Z

@pfackeldey - I wonder if it's better to convert/move the issue to a discussion?

pfackeldey added docs Improvements or additions to documentation question Open-ended questions from users and removed docs Improvements or additions to documentation labels Dec 12, 2024

pfackeldey mentioned this issue Dec 12, 2024

Expose column optimization more explicitly? dask-contrib/dask-awkward#559

Open

pfackeldey mentioned this issue Dec 13, 2024

fix: allow int as placeholder array index type scikit-hep/awkward#3341

Closed

scikit-hep locked and limited conversation to collaborators Jan 14, 2025

ianna converted this issue into discussion #1361 Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Unclear how to manually do column projection with `uproot.dask` (and API differences with `dask-awkward`) #1349

Unclear how to manually do column projection with `uproot.dask` (and API differences with `dask-awkward`) #1349

pfackeldey commented Dec 12, 2024 •

edited

Loading

pfackeldey commented Dec 12, 2024

martindurant commented Dec 13, 2024

pfackeldey commented Dec 13, 2024 •

edited

Loading

agoose77 commented Dec 13, 2024 •

edited

Loading

pfackeldey commented Dec 13, 2024

agoose77 commented Dec 13, 2024 •

edited

Loading

pfackeldey commented Jan 14, 2025 •

edited

Loading

agoose77 commented Jan 14, 2025

ianna commented Jan 14, 2025

This issue was moved to a discussion.

This issue was moved to a discussion.

Unclear how to manually do column projection with uproot.dask (and API differences with dask-awkward) #1349

Unclear how to manually do column projection with uproot.dask (and API differences with dask-awkward) #1349

Comments

pfackeldey commented Dec 12, 2024 • edited Loading

pfackeldey commented Dec 12, 2024

martindurant commented Dec 13, 2024

pfackeldey commented Dec 13, 2024 • edited Loading

agoose77 commented Dec 13, 2024 • edited Loading

pfackeldey commented Dec 13, 2024

agoose77 commented Dec 13, 2024 • edited Loading

pfackeldey commented Jan 14, 2025 • edited Loading

agoose77 commented Jan 14, 2025

ianna commented Jan 14, 2025

This issue was moved to a discussion.

Unclear how to manually do column projection with `uproot.dask` (and API differences with `dask-awkward`) #1349

Unclear how to manually do column projection with `uproot.dask` (and API differences with `dask-awkward`) #1349

pfackeldey commented Dec 12, 2024 •

edited

Loading

pfackeldey commented Dec 13, 2024 •

edited

Loading

agoose77 commented Dec 13, 2024 •

edited

Loading

agoose77 commented Dec 13, 2024 •

edited

Loading

pfackeldey commented Jan 14, 2025 •

edited

Loading