Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accounting for missing values in active storage operations. #18

Open
davidhassell opened this issue Oct 4, 2022 · 14 comments
Open

Accounting for missing values in active storage operations. #18

davidhassell opened this issue Oct 4, 2022 · 14 comments
Assignees
Labels
data handling handling data enhancement New feature or request excalibur Needs discussion by the excalibur team
Milestone

Comments

@davidhassell
Copy link
Collaborator

No answers yet, just an statement of need.

Missing values need to be accounted for during active operations. For instance, a land-surface temperature minimum needs to ignore a missing_value of -1e20 over the oceans. Therefore the missing values (of which there can be 0 to many) need to be passed to the active storage, similarly to how the data type needs to be passed.

Things get complicated because there are many different ways of specifying missing values (https://docs.unidata.ucar.edu/nug/current/attribute_conventions.html), some of which are not simple numbers:

  • The value of the _FillValue
  • The value or values of the missing_value (which may be a scalar or vector)
  • Any value strictly less than the valid_min number, or the first of the valid_range numbers
  • Any value strictly greater than the valid_max number, or the second of the valid_range numbers

All of these methods are used in the wild.

The fixed missing values are typically floats which need to match exactly with values in the data, so a string decimal representation created by the client might not convert back to the exact binary representation on the storage. Does DAP deal with this, I wonder?

@bnlawrence
Copy link
Collaborator

Oh bugger. I had forgotten about all the edge cases ...

I don't think DAP has to handle this does it? Insofar as they have the NetCDF file itself and the NetCDF semantics available server-side ... the active storage will not.

@bnlawrence
Copy link
Collaborator

bnlawrence commented Oct 4, 2022

This could be an 80/20 situation:

  • If we handle _FillValue and missing_value (scalar), do we think there are many cases where both might be present?
    • Oh bugger, yes we do, since all cases of missing_value are likely to occur in the presence of a _FillValue, so we are always dealing with a vector of possible values to treat as missing.
  • If we handle min and max, do we think there are may cases where range might be present?

In the situation where we can't handle it, we default to normal storage operations of course ...

@bnlawrence
Copy link
Collaborator

We should at least force "normal" operations for now, if any of these are present in metadata.

@davidhassell
Copy link
Collaborator Author

Sounds like a good way forward. CMIP6 metadata mandates that you should use both _FillValue and missing_value and that they both should have the same value. This is of course not necessarily general practice elsewhere, but for model data I would have thought it is (almost) always the case.

Looking further ahead, providing a single number to the storage is probably no harder than providing "a few" but, as you say, no need to worry about that at this moment.

@bnlawrence
Copy link
Collaborator

I suggest we make a few dummy files by extending dummy data.py to explore the range of these possible missing value options, and that we introduce some code to detect them all ... and reject them for active storage processing for now. When we have that, we can start unpicking them one-by-one, starting with the CMIP6 use case.

@bnlawrence bnlawrence added the excalibur Needs discussion by the excalibur team label Oct 4, 2022
@valeriupredoi
Copy link
Collaborator

valeriupredoi commented Oct 4, 2022

Very good question, David! I think the missing data value (wheter it be _FillValue or missing_value) should be extracted from the file's metadata as Bryan says, so we should check for either, if they are both present but the actual float value differs then we choose 1.e+20 😁

@davidhassell
Copy link
Collaborator Author

if they are both present but the actual float value differs then we choose 1.e+20

In this case, we process on the client, surely, as netCDF4-python deals with all cases.

@bnlawrence
Copy link
Collaborator

Yes, we need to process on the client in all cases where the server can't handle it directly ...

@bnlawrence
Copy link
Collaborator

What about error handling? What if a chunk is all missing? I think the right answer would be to return a missing value, and that has to be handled above.

@davidhassell
Copy link
Collaborator Author

davidhassell commented Oct 7, 2022

What if a chunk is all missing? I think the right answer would be to return a missing value, and that has to be handled above.

(Edited - sent prematurely)

I think that makes sense, as that also handles the case that all chunks are missing, for which the reduced answer is the mdi. That implies that the methods (like np.sum) should be their masked counterpats (like np.ma.sum)

@bnlawrence
Copy link
Collaborator

After today's conversation, we decided a reasonable option to avoid a potentially infinite length vector of "missing values", would be to support up to four numbers of missing information: valid_min, valid_max, missing_value, and _FillValue. If there was a vector of missing numbers in play, we'd simply default to "non-computational" storage.

@bnlawrence
Copy link
Collaborator

@valeriupredoi Can you please look and see if we have access to those missing value attributes in the zarr dataset object itself? (i.e. will it be easy for us to pass them to _decode_chunk?

@valeriupredoi
Copy link
Collaborator

they are inside the bellows - see eg here but accessing and manipulating them from the API is a different dish of curry. I will investigate in more detail next week, ESMValX-releases permitting 👍

@bnlawrence
Copy link
Collaborator

bnlawrence commented Oct 18, 2022

Argh, the interpretation for _FillValue is not as straightforward as you might think. See this issue, although I think the netcdf user guide has since been updated (and netcdf4-python no longer does that, so I don't think we want to replicate the use of _FillValue as a max or min ... but recording this here so we put something in the code so anyone falling over this in the future will be aware.

@valeriupredoi valeriupredoi added the data handling handling data label Oct 24, 2022
@bnlawrence bnlawrence modified the milestones: Post-Prototype, Prototype Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data handling handling data enhancement New feature or request excalibur Needs discussion by the excalibur team
Projects
None yet
Development

No branches or pull requests

3 participants