Accounting for missing values in active storage operations. #18

davidhassell · 2022-10-04T10:17:27Z

No answers yet, just an statement of need.

Missing values need to be accounted for during active operations. For instance, a land-surface temperature minimum needs to ignore a missing_value of -1e20 over the oceans. Therefore the missing values (of which there can be 0 to many) need to be passed to the active storage, similarly to how the data type needs to be passed.

Things get complicated because there are many different ways of specifying missing values (https://docs.unidata.ucar.edu/nug/current/attribute_conventions.html), some of which are not simple numbers:

The value of the _FillValue
The value or values of the missing_value (which may be a scalar or vector)
Any value strictly less than the valid_min number, or the first of the valid_range numbers
Any value strictly greater than the valid_max number, or the second of the valid_range numbers

All of these methods are used in the wild.

The fixed missing values are typically floats which need to match exactly with values in the data, so a string decimal representation created by the client might not convert back to the exact binary representation on the storage. Does DAP deal with this, I wonder?

The text was updated successfully, but these errors were encountered:

bnlawrence · 2022-10-04T10:50:24Z

Oh bugger. I had forgotten about all the edge cases ...

I don't think DAP has to handle this does it? Insofar as they have the NetCDF file itself and the NetCDF semantics available server-side ... the active storage will not.

bnlawrence · 2022-10-04T10:54:53Z

This could be an 80/20 situation:

If we handle _FillValue and missing_value (scalar), do we think there are many cases where both might be present?
- Oh bugger, yes we do, since all cases of missing_value are likely to occur in the presence of a _FillValue, so we are always dealing with a vector of possible values to treat as missing.
If we handle min and max, do we think there are may cases where range might be present?

In the situation where we can't handle it, we default to normal storage operations of course ...

bnlawrence · 2022-10-04T10:55:45Z

We should at least force "normal" operations for now, if any of these are present in metadata.

davidhassell · 2022-10-04T11:11:39Z

Sounds like a good way forward. CMIP6 metadata mandates that you should use both _FillValue and missing_value and that they both should have the same value. This is of course not necessarily general practice elsewhere, but for model data I would have thought it is (almost) always the case.

Looking further ahead, providing a single number to the storage is probably no harder than providing "a few" but, as you say, no need to worry about that at this moment.

bnlawrence · 2022-10-04T11:15:14Z

I suggest we make a few dummy files by extending dummy data.py to explore the range of these possible missing value options, and that we introduce some code to detect them all ... and reject them for active storage processing for now. When we have that, we can start unpicking them one-by-one, starting with the CMIP6 use case.

valeriupredoi · 2022-10-04T11:33:05Z

Very good question, David! I think the missing data value (wheter it be _FillValue or missing_value) should be extracted from the file's metadata as Bryan says, so we should check for either, if they are both present but the actual float value differs then we choose 1.e+20 😁

davidhassell · 2022-10-04T12:12:49Z

if they are both present but the actual float value differs then we choose 1.e+20

In this case, we process on the client, surely, as netCDF4-python deals with all cases.

bnlawrence · 2022-10-04T13:41:42Z

Yes, we need to process on the client in all cases where the server can't handle it directly ...

bnlawrence · 2022-10-07T10:51:17Z

What about error handling? What if a chunk is all missing? I think the right answer would be to return a missing value, and that has to be handled above.

davidhassell · 2022-10-07T11:15:41Z

What if a chunk is all missing? I think the right answer would be to return a missing value, and that has to be handled above.

(Edited - sent prematurely)

I think that makes sense, as that also handles the case that all chunks are missing, for which the reduced answer is the mdi. That implies that the methods (like np.sum) should be their masked counterpats (like np.ma.sum)

bnlawrence · 2022-10-07T11:22:36Z

After today's conversation, we decided a reasonable option to avoid a potentially infinite length vector of "missing values", would be to support up to four numbers of missing information: valid_min, valid_max, missing_value, and _FillValue. If there was a vector of missing numbers in play, we'd simply default to "non-computational" storage.

bnlawrence · 2022-10-07T14:52:22Z

@valeriupredoi Can you please look and see if we have access to those missing value attributes in the zarr dataset object itself? (i.e. will it be easy for us to pass them to _decode_chunk?

valeriupredoi · 2022-10-07T14:59:05Z

they are inside the bellows - see eg here but accessing and manipulating them from the API is a different dish of curry. I will investigate in more detail next week, ESMValX-releases permitting 👍

bnlawrence · 2022-10-18T11:09:30Z

Argh, the interpretation for _FillValue is not as straightforward as you might think. See this issue, although I think the netcdf user guide has since been updated (and netcdf4-python no longer does that, so I don't think we want to replicate the use of _FillValue as a max or min ... but recording this here so we put something in the code so anyone falling over this in the future will be aware.

davidhassell added the enhancement New feature or request label Oct 4, 2022

davidhassell mentioned this issue Oct 4, 2022

We should push the missing value down to the chunks #19

Closed

bnlawrence assigned valeriupredoi Oct 4, 2022

bnlawrence added the excalibur Needs discussion by the excalibur team label Oct 4, 2022

bnlawrence mentioned this issue Oct 13, 2022

Missing and compression/filtering Issues #24

Merged

valeriupredoi added the data handling handling data label Oct 24, 2022

bnlawrence modified the milestones: Post-Prototype, Prototype Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accounting for missing values in active storage operations. #18

Accounting for missing values in active storage operations. #18

davidhassell commented Oct 4, 2022

bnlawrence commented Oct 4, 2022

bnlawrence commented Oct 4, 2022 •

edited

Loading

bnlawrence commented Oct 4, 2022

davidhassell commented Oct 4, 2022

bnlawrence commented Oct 4, 2022

valeriupredoi commented Oct 4, 2022 •

edited

Loading

davidhassell commented Oct 4, 2022

bnlawrence commented Oct 4, 2022

bnlawrence commented Oct 7, 2022

davidhassell commented Oct 7, 2022 •

edited

Loading

bnlawrence commented Oct 7, 2022

bnlawrence commented Oct 7, 2022

valeriupredoi commented Oct 7, 2022

bnlawrence commented Oct 18, 2022 •

edited

Loading

Accounting for missing values in active storage operations. #18

Accounting for missing values in active storage operations. #18

Comments

davidhassell commented Oct 4, 2022

bnlawrence commented Oct 4, 2022

bnlawrence commented Oct 4, 2022 • edited Loading

bnlawrence commented Oct 4, 2022

davidhassell commented Oct 4, 2022

bnlawrence commented Oct 4, 2022

valeriupredoi commented Oct 4, 2022 • edited Loading

davidhassell commented Oct 4, 2022

bnlawrence commented Oct 4, 2022

bnlawrence commented Oct 7, 2022

davidhassell commented Oct 7, 2022 • edited Loading

bnlawrence commented Oct 7, 2022

bnlawrence commented Oct 7, 2022

valeriupredoi commented Oct 7, 2022

bnlawrence commented Oct 18, 2022 • edited Loading

bnlawrence commented Oct 4, 2022 •

edited

Loading

valeriupredoi commented Oct 4, 2022 •

edited

Loading

davidhassell commented Oct 7, 2022 •

edited

Loading

bnlawrence commented Oct 18, 2022 •

edited

Loading