Harmonize nodata usage #375

johanvdw · 2024-04-24T09:38:15Z

There are too many different nodata definitions in use in niche:

if data is float 32: np.nan is use
if data is uint8: 255 is used
in other cases -99 is used

I think we should consider moving all data types to float32. Areas are relatively small, and we compress tif files anyway. This would also mean we can use np.nan for no data, which propagates properly without special tweaks.

masked arrays are another option to remove this low level code, but fixing the data type is probably even better.

Note that currently the code seems to be working well, but it is rather complex internally, leading eg to #335 .

cecileherr · 2024-05-28T15:58:32Z

A side note: in the past (with niche 1.2, Win 10, 8Go RAM but hardly any free hard disk < 10 Go) I have had problems with memory issues with some projects (example: for a project with resolution of 5*5 m

MemoryError: unable to allocate 43.8 MiB for an array with shape (2760, 4160) and data type float32

).

I suspect changing the data type to float/solving issue #335 might lead to more memory problems (?). Would it be possible to test this/give an idea of the impact on memory/speed? Thx!

stijnvanhoey · 2024-10-01T19:06:00Z

I did a check on the usage of masked arrays, see #387, but I will reverse the masked array implementation. Whereas the usage simplifies the implementation an no-data-regions considerably, the reasons are:

https://numpy.org/doc/stable/reference/module_structure.html#legacy-namespaces mentions numpy.ma as deprecated module (note, this is not mentioned in https://numpy.org/doc/stable/reference/maskedarray.html) which requires overhaul
reports of others that masked arrays are slower (e.g. Use of masked arrays comes with huge performance drop Unidata/netcdf4-python#809). I did test this out myself on the current code and the decrease in speed is considerable, e.g.
- unit test "test_zwarte_beek" -> 2.9s (median) voor current master (no masked arrays) versus 7.7s (median) voor masked-array approach
- single niche modle run in getting-started notebook -> 623 ms ± 118 ms voor current master versus 1.93 s ± 506 ms voor masked-array approach

As this is more than double the time, we do no longer consider this a valuable option and focus on predefined data types (uint8 and float32) for each variable with clear no-data implementation (255 and np.nan for uint8 and float32 respectively)

johanvdw mentioned this issue Apr 24, 2024

Fix soilcode nan #371

Merged

stijnvanhoey mentioned this issue Oct 2, 2024

Harmonize data types and no-data handling in grids #387

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harmonize nodata usage #375

Harmonize nodata usage #375

johanvdw commented Apr 24, 2024

cecileherr commented May 28, 2024

stijnvanhoey commented Oct 1, 2024

Harmonize nodata usage #375

Harmonize nodata usage #375

Comments

johanvdw commented Apr 24, 2024

cecileherr commented May 28, 2024

stijnvanhoey commented Oct 1, 2024