Multiple operations instead of means #33

bnlawrence · 2022-10-25T06:13:16Z

In working through the implications of implementing means in chunks, it is notable that once missing data is in play, we need to return two numbers from the reduce_chunk method: the sum, and the count, because means over chunks will be needed to be weighted by the actual number of values being meaned.

There are a number of ways we could implement this:

Always return (X, N), where X is the expected operation, and N the number of values contributing
Only return (X, N) when required (e.g. for means) otherwise return (X,None) or (X,)
Return X, except when it needs to be (X,N)
Something else.

The something else option could be slightly more interesting: do we think it's a smart idea to say we could chain a series of methods and expect a series of results, in a lightweight sort of caching?

Obvious use cases would be:

mean = sum, count
range = min, max
sqmean = sum(squares), sum, count

This could be facilitated by handing not just "a method" but a list of 1.. many methods, and expect back a list of 1..many results.

The text was updated successfully, but these errors were encountered:

bnlawrence · 2022-10-25T06:23:04Z

@mkjpryor, I'm pretty sure you saw this coming ...

markgoddard · 2022-10-25T08:36:49Z

I thought about this recently, but the thought disappeared before I did anything with it. For me, Option 1. seems like the most sensible of the easy options. Option 4. doesn't seem like it would be too taxing to implement.

davidhassell · 2022-10-25T09:22:43Z

I would want N to always be available, so I would favour Option 1.

My use case for this would be that cf-python uses N for more than just the mean (and other statistical moments): It gives the user the opportunity to mask out results that have too small an input sample - e.g. you can say "give me the time maximum of this T-Y-X data, but only where at least half the times are not missing data". For that you need N to propagate through the client side (dask) reduction. Whilst cf-python could probably work out N if it were not provided, as the size of a workflow chunk, if there's no penalty in always returning N, it seems cleaner to not introduce complexity into >=2 libraries if we don't have to.

The something else option could be slightly more interesting: do we think it's a smart idea to say we could chain a series of methods and expect a series of results, in a lightweight sort of caching?

This is interesting. I'm thinking you mean something like a chaining framework on the storage that always did 1) read the data from disk and then 2) get the results of the following list of operations on the in-memory data. E.g. the client asks for "range" and gets back from Active something like {'min': -45, 'max': 65, N: 1200}` for each workflow chunk. These numbers would have been calculated on the storage by one read of each storage chunk followed by the three in-memory operations (max, min, count) that the storage knows are required for a range.

bnlawrence · 2022-10-25T09:34:32Z

Thanks Mark, David. ok, I'm implementing Option 1 for now ... clearly option 4 is an extension, and we can consider that in the (possibly near) future ... and yes David, the interpretation you have above is what I had in mind.

valeriupredoi · 2022-10-25T10:12:40Z

I would argue that the generic operation of masking ie returning reduced data based on a condition, should be an integral part of AS and be performed by the storage unit/its software. Masking reduces data by a lot, it is used very often in the field, and is relatively costly because data needs to be looked at chunk by chunk. I don't know if this was in the initial design, but I reckon it has to be, the sooner the better. For now, I too agree that returning (X, N) is the way to go, but I see this only as a toy/interim solution 🍺

markgoddard · 2022-10-25T12:19:56Z

@valeriupredoi how would you define the condition? Comparison with a null value, or something more complex?

bnlawrence · 2022-10-27T09:39:43Z

We had a bit of a discussion about masking today. The bottom line is that doing anything beyond what we have done for missing data would likely have minimal impact in the use cases we discussed - since the real benefit of all this work will come with high volume data, which will likely be compressed, in which case masking is already efficiently hidden. Unmasking before returning a chunk from storage is not likely to be beneficial in this situation.

bnlawrence · 2022-10-27T09:40:45Z

The option for carrying out a series of operations would as @markgoddard suggests, require relatively work in client and servers, but we wont consider it until we have everything else working.

bnlawrence added the excalibur Needs discussion by the excalibur team label Oct 25, 2022

bnlawrence mentioned this issue Oct 25, 2022

Introduce the controlled vocabulary for operations into PyActiveStorage #34

Open

bnlawrence self-assigned this Oct 27, 2022

bnlawrence added this to the Post-Prototype milestone Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple operations instead of means #33

Multiple operations instead of means #33

bnlawrence commented Oct 25, 2022 •

edited

Loading

bnlawrence commented Oct 25, 2022

markgoddard commented Oct 25, 2022

davidhassell commented Oct 25, 2022

bnlawrence commented Oct 25, 2022 •

edited

Loading

valeriupredoi commented Oct 25, 2022

markgoddard commented Oct 25, 2022

bnlawrence commented Oct 27, 2022

bnlawrence commented Oct 27, 2022

Multiple operations instead of means #33

Multiple operations instead of means #33

Comments

bnlawrence commented Oct 25, 2022 • edited Loading

bnlawrence commented Oct 25, 2022

markgoddard commented Oct 25, 2022

davidhassell commented Oct 25, 2022

bnlawrence commented Oct 25, 2022 • edited Loading

valeriupredoi commented Oct 25, 2022

markgoddard commented Oct 25, 2022

bnlawrence commented Oct 27, 2022

bnlawrence commented Oct 27, 2022

bnlawrence commented Oct 25, 2022 •

edited

Loading

bnlawrence commented Oct 25, 2022 •

edited

Loading