Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple operations instead of means #33

Open
bnlawrence opened this issue Oct 25, 2022 · 8 comments
Open

Multiple operations instead of means #33

bnlawrence opened this issue Oct 25, 2022 · 8 comments
Assignees
Labels
excalibur Needs discussion by the excalibur team

Comments

@bnlawrence
Copy link
Collaborator

bnlawrence commented Oct 25, 2022

In working through the implications of implementing means in chunks, it is notable that once missing data is in play, we need to return two numbers from the reduce_chunk method: the sum, and the count, because means over chunks will be needed to be weighted by the actual number of values being meaned.

There are a number of ways we could implement this:

  1. Always return (X, N), where X is the expected operation, and N the number of values contributing
  2. Only return (X, N) when required (e.g. for means) otherwise return (X,None) or (X,)
  3. Return X, except when it needs to be (X,N)
  4. Something else.

The something else option could be slightly more interesting: do we think it's a smart idea to say we could chain a series of methods and expect a series of results, in a lightweight sort of caching?

Obvious use cases would be:

  • mean = sum, count
  • range = min, max
  • sqmean = sum(squares), sum, count

This could be facilitated by handing not just "a method" but a list of 1.. many methods, and expect back a list of 1..many results.

@bnlawrence bnlawrence added the excalibur Needs discussion by the excalibur team label Oct 25, 2022
@bnlawrence
Copy link
Collaborator Author

@mkjpryor, I'm pretty sure you saw this coming ...

@markgoddard
Copy link

I thought about this recently, but the thought disappeared before I did anything with it. For me, Option 1. seems like the most sensible of the easy options. Option 4. doesn't seem like it would be too taxing to implement.

@davidhassell
Copy link
Collaborator

I would want N to always be available, so I would favour Option 1.

My use case for this would be that cf-python uses N for more than just the mean (and other statistical moments): It gives the user the opportunity to mask out results that have too small an input sample - e.g. you can say "give me the time maximum of this T-Y-X data, but only where at least half the times are not missing data". For that you need N to propagate through the client side (dask) reduction. Whilst cf-python could probably work out N if it were not provided, as the size of a workflow chunk, if there's no penalty in always returning N, it seems cleaner to not introduce complexity into >=2 libraries if we don't have to.

The something else option could be slightly more interesting: do we think it's a smart idea to say we could chain a series of methods and expect a series of results, in a lightweight sort of caching?

This is interesting. I'm thinking you mean something like a chaining framework on the storage that always did 1) read the data from disk and then 2) get the results of the following list of operations on the in-memory data. E.g. the client asks for "range" and gets back from Active something like {'min': -45, 'max': 65, N: 1200}` for each workflow chunk. These numbers would have been calculated on the storage by one read of each storage chunk followed by the three in-memory operations (max, min, count) that the storage knows are required for a range.

@bnlawrence
Copy link
Collaborator Author

bnlawrence commented Oct 25, 2022

Thanks Mark, David. ok, I'm implementing Option 1 for now ... clearly option 4 is an extension, and we can consider that in the (possibly near) future ... and yes David, the interpretation you have above is what I had in mind.

@valeriupredoi
Copy link
Collaborator

I would argue that the generic operation of masking ie returning reduced data based on a condition, should be an integral part of AS and be performed by the storage unit/its software. Masking reduces data by a lot, it is used very often in the field, and is relatively costly because data needs to be looked at chunk by chunk. I don't know if this was in the initial design, but I reckon it has to be, the sooner the better. For now, I too agree that returning (X, N) is the way to go, but I see this only as a toy/interim solution 🍺

@markgoddard
Copy link

@valeriupredoi how would you define the condition? Comparison with a null value, or something more complex?

@bnlawrence bnlawrence self-assigned this Oct 27, 2022
@bnlawrence
Copy link
Collaborator Author

We had a bit of a discussion about masking today. The bottom line is that doing anything beyond what we have done for missing data would likely have minimal impact in the use cases we discussed - since the real benefit of all this work will come with high volume data, which will likely be compressed, in which case masking is already efficiently hidden. Unmasking before returning a chunk from storage is not likely to be beneficial in this situation.

@bnlawrence
Copy link
Collaborator Author

The option for carrying out a series of operations would as @markgoddard suggests, require relatively work in client and servers, but we wont consider it until we have everything else working.

@bnlawrence bnlawrence added this to the Post-Prototype milestone Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
excalibur Needs discussion by the excalibur team
Projects
None yet
Development

No branches or pull requests

4 participants