Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String handling #111

Open
znichollscr opened this issue Feb 6, 2025 · 3 comments
Open

String handling #111

znichollscr opened this issue Feb 6, 2025 · 3 comments

Comments

@znichollscr
Copy link

At the outset, I'm not sure if this is a user error or a bug in ncdata or a bug upstream in xarray/iris or a bug downstream in netCDF4. I'm asking because it occurs in ncdata, but feel free to send me looking elsewhere if that makes more sense.

I was playing around with writing strings into a netCDF file. There seems to be multiple ways to do this, some of which seem to work fine, others of which raise errors.

For running all these demos, I used a Python 3.11 virtual environment with the following requirements.txt file. I'm working on a mac.

Requirements

ncdata==0.1.1
netCDF4==1.7.2
scitools-iris==3.11.1
xarray==2025.1.2

Passing example

If you create the array using a character array, this seems to all be happy

import iris
import netCDF4
import numpy as np
from ncdata.iris import from_iris
from ncdata.iris_xarray import cubes_to_xarray
from ncdata.netcdf4 import from_nc4

iris.FUTURE.save_split_attrs = True

with netCDF4.Dataset("demo.nc", "w") as ds:
    regions_l = ["Australia", "New Zealand", "England"]
    regions_max_length = max(len(v) for v in regions_l)

    regions = np.array(regions_l, dtype=f"S{regions_max_length}")
    ds.createDimension("lbl", len(regions))
    ds.createDimension("strlen", regions_max_length)
    ds.createVariable("region", "S1", ("lbl", "strlen"))
    ds["region"][:] = netCDF4.stringtochar(regions)


# None of these raise any errors
from_nc4("demo.nc")
cube = iris.load("demo.nc")
from_iris(cube)
cubes_to_xarray(cube)

The output netCDF file also looks sensible

ncdump demo.nc
netcdf demo {
dimensions:
	lbl = 3 ;
	strlen = 11 ;
variables:
	char region(lbl, strlen) ;
data:

 region =
  "Australia",
  "New Zealand",
  "England" ;
}

Failing example 1 - something to do with encoding

If you create the array using a character array but let netCDF4 do the encoding, the string encoding seems to not work if you load from iris then try and convert with ncdata (suggests the bug is in iris?).

import iris
import netCDF4
import numpy as np
from ncdata.iris import from_iris
from ncdata.iris_xarray import cubes_to_xarray
from ncdata.netcdf4 import from_nc4

iris.FUTURE.save_split_attrs = True

with netCDF4.Dataset("demo.nc", "w") as ds:
    regions_l = ["Australia", "New Zealand", "England"]
    regions_max_length = max(len(v) for v in regions_l)

    regions = np.array(regions_l, dtype=f"S{regions_max_length}")
    ds.createDimension("lbl", len(regions))
    ds.createDimension("strlen", regions_max_length)
    ds.createVariable("region", "S1", ("lbl", "strlen"))
    ds["region"]._Encoding = "ascii"
    ds["region"][:] = regions


from_nc4("demo.nc")

cube = iris.load("demo.nc")
from_iris(cube)
"""
The line above gives the following error

...

  File ".../venv/lib/python3.11/site-packages/ncdata/dataset_like.py", line 284, in _get_fillvalue
    fv = netCDF4.default_fillvals[dtype_code]
         ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'U11'
"""
cubes_to_xarray(cube)

The underlying netCDF file looks sensible though.

netcdf demo {
dimensions:
	lbl = 3 ;
	strlen = 11 ;
variables:
	char region(lbl, strlen) ;
		region:_Encoding = "ascii" ;
data:

 region =
  "Australia",
  "New Zealand",
  "England" ;
}

Failing example 2 - variable length strings

If you write using a variable length string, then the error appears to come from ncdata. However, iris also can't load the file, so maybe this just isn't a supported use case.

import iris
import netCDF4
import numpy as np
from ncdata.iris import from_iris
from ncdata.iris_xarray import cubes_to_xarray
from ncdata.netcdf4 import from_nc4

iris.FUTURE.save_split_attrs = True

with netCDF4.Dataset("demo.nc", "w") as ds:
    regions_l = ["Australia", "New Zealand", "England"]
    regions_max_length = max(len(v) for v in regions_l)

    regions = np.array(regions_l, dtype="O")
    ds.createDimension("lbl", len(regions))
    ds.createVariable("region", str, ("lbl",))
    ds["region"][:] = regions


from_nc4("demo.nc")
"""
The line above gives the following error

Traceback (most recent call last):
  File ".../demo-variable-str-failing.py", line 20, in <module>
    from_nc4("demo.nc")
  File ".../venv/lib/python3.11/site-packages/ncdata/netcdf4.py", line 308, in from_nc4
    ncdata = _from_nc4_group(nc4ds)
             ^^^^^^^^^^^^^^^^^^^^^^
  File ".../venv/lib/python3.11/site-packages/ncdata/netcdf4.py", line 264, in _from_nc4_group
    var.data = da.from_array(
               ^^^^^^^^^^^^^^
  File ".../venv/lib/python3.11/site-packages/dask/array/core.py", line 3523, in from_array
    chunks = normalize_chunks(
             ^^^^^^^^^^^^^^^^^
  File ".../venv/lib/python3.11/site-packages/dask/array/core.py", line 3130, in normalize_chunks
    chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../venv/lib/python3.11/site-packages/dask/array/core.py", line 3304, in auto_chunks
    raise ValueError(
ValueError: auto-chunking with dtype.itemsize == 0 is not supported, please pass in `chunks` explicitly
"""

cube = iris.load("demo.nc")
from_iris(cube)
cubes_to_xarray(cube)

The underlying netCDF seems to be valid, but maybe I'm missing something.

ncdump demo.nc
netcdf demo {
dimensions:
	lbl = 3 ;
variables:
	string region(lbl) ;
data:

 region = "Australia", "New Zealand", "England" ;
}

@pp-mo not sure if you have any thoughts?

@pp-mo
Copy link
Owner

pp-mo commented Feb 7, 2025

@znichollscr thanks so much for getting in touch !
And escpecially for your simple + direct code examples.

I absolutely do have thoughts (!)

Unfortunately, its rather complicated.
There are quite a few "layers" involved here, and as a developer of both Iris and ncdata I can only apologize for shortcomings in both !
I hope ncdata can be forgiven some as it is still pretty new, but the key problems in Iris here are rather longstanding..

I'll just quickly list the key problems I think are involved here, and afterward relate them to what I think is going on ...

  • regarding your example (1), Iris support for string variables (i.e. cubes) is not correct
    • I believe that it does not load "normal" string variables (i.e. CF-sanctioned character-array types) correctly
    • it does not save them correctly either (!)
  • it also looks like ncdata is not able to handle types of data array it didn't expect to get from Iris (!)
  • regarding your example (2) mostly (but also affecting 1) , netCDF4 variable-length ("string") data type causes problems ..
    • iris can't load this type of variable
    • ( somewhat by the way, xarray can load it, but not with chunks="auto" because dask won't support it )
    • all this is due to problems choosing a chunksize due to an "unknown" item size of a variable-length type ...
    • ... so, ncdata can't load this either,
      but there is a new feature coming in v0.2 which provides a workaround
      please see the account in the forthcoming new documentation
      e.g. here (sorry, these links may not be stable -- we are still working on this PR ! and NB I linked to the method because the link from the how-to back to the method is broken 😩 )

@pp-mo
Copy link
Owner

pp-mo commented Feb 7, 2025

For reference, I'm hoping to get the v0.2 release out in the next week or so.
And Iris release v3.12 is also due by the end of this month.

Sadly I'm not sure we can promise to squeeze fixes to string data handling into this release, but we might take a look at it.

I don't have more time this morning to explain more details of the problems you are reporting (though I certainly will do soon), but it would be useful to know what you really need to achieve here

-- is ncdata is really important, or do you just need to fix it so you can handle this type of data with Iris ?

@znichollscr
Copy link
Author

Hi @pp-mo, thanks for the great reply.

is ncdata is really important, or do you just need to fix it so you can handle this type of data with Iris ?

Good question. Actually, the answer is, for the very immediate future: I don't need this at all, I just worked around it. However, given that I'd noticed this, I figured I would report it.

The back story of what I'm actually trying to do is below, if it's of interest.

Back story

I'm helping out with getting all the forcings data published for the CMIP7 AR7 fast-track. As I go along, I'm just trying to put issues in helpful places to generally try and help out the ecosystem (e.g. this issue was a result of trying to get a dust dataset, which includes a region dimension, in the right format for the ESGF PCMDI/input4MIPs_CVs#140).

In practice, really what I'm trying to do is work out a sane way to write CF-compliant, CMIP-controlled-vocabulary-compliant netCDF files. When I started https://github.com/climate-resource/input4mips_validation, the simplest way was to use iris. However, in general I find it way easier to work with xarray. In trying to work out how to convert from xarray to iris, I stumbled upon ncdata.

I then learnt about the existence of https://github.com/NCAS-CMS/cf-python. That makes it pretty obvious: if you want to write CF-compliant files, use the package which implements the spec.

Having said that, I didn't want to convert my entire stack to being cf-python based. (I think that's also the genius of ncdata, you don't have to choose, you just work in whatever format is best for you then convert at the end.) So, my next thought was, help out with the ncdata cf-python converter, which is why I popped up here: #95 (although, as you can tell, I've then had no time since).

The issue probably boils down to this: the only package that is actually tightly coupled to the netCDF API is netCDF4. All the other packages (except maybe ncdata) are data containers, so it seems like they make a tradeoff between complete faith to what is in the underlying netCDF file and usability. However, I don't know the netCDF spec well enough to know if that's actually the case, or whether there is a package other than netCDF4 which will just give you a view into a netCDF file, without doing helpful conversions along the way (again, maybe that's the goal of ncdata?). In the meantime, I'm just bumbling my way through, working around quirks as I need and doing my best to report things where I see them without creating a ton of noise for everyone else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants