CF-specfication: improvements to the trajectory standard #123

knutfrode · 2024-10-09T13:46:52Z

knutfrode
Oct 9, 2024
Maintainer

erikvansebille · 2024-10-09T15:21:16Z

erikvansebille
Oct 9, 2024

[I can't edit the original post at the top myself, but here some reflection from my side]

Introduction/rationale

As a community, we should be very happy that cfconventions.org provides a CF convention for trajectory data. However, the present convention (see here) may not be as suitable/flexible to our needs as we would hope. Hence, this discussion is meant to gather thoughts of what we would like to see in a CF convention for Lagrangian particle trajectory data, so that we come to a proposal for the cfconventions team.

0 replies

erikvansebille · 2024-10-09T15:28:20Z

erikvansebille
Oct 9, 2024

I have two main issues with the present trajectories CF-convention:

The order of the dimensions (obs, trajectory) complicates simple plotting routines. As explained in Reconsider the order of (trajectory, obs) in the zarr output file? OceanParcels/Parcels#1632, the simplest plotting script actually requires transposing (.T) the lat and lon arrays when one wants to plot trajectories. It would be much easier when the order of the dimensions is reversed
```
ds = xr.open_zarr(filename)
plt.plot(ds.lon.T, ds.lat.T)
```
It is inconsistent that one dimension (obs) is an abbreviation of a word, while the other dimension (trajectory) is a full word. My preference would be to use obs and traj as dimensions, but I could also live with observation and trajectory.

0 replies

ChrisBarker-NOAA · 2024-10-11T16:04:14Z

ChrisBarker-NOAA
Oct 11, 2024

A few notes:

The order of the dimensions (obs, trajectory)

I don't think that the CF spec requires that they be in that order. For the most part, CF often recommends an order, but doesn't require it. And I don't think in the Trajectory spec that they even recommend. Though I certainly learn best through examples and all the examples have that order.

I think that the "trajectory" variable:

      char trajectory(trajectory, name_strlen) ;
          trajectory:cf_role = "trajectory_id";

is the specification for which dimension is which.

That being said, I don't think "easy to plot with MPL" should be the primary motivation for choosing dimension order. dimension order can have a very large impact on performance when reading/writing data, and on working with data in memory. So CF usually recommends that the "fastest varying" dimension (the first one) be chosen to be the one that keeps data together that is liley to be used together -- e.g. for model results, time is usually the first dimension. I think that applies here, too.

It is inconsistent that one dimension (obs) is an abbreviation of a word, while the other dimension (trajectory) is a full word. My preference would be to use obs and traj as dimensions, but I could also live with observation and trajectory.

CF also has very little to say about what names you use [*] -- obs and trajectory are their examples, but you can use whatever names you want. This does make it a bit awkward to process an arbitrary file, but it is flexible, and once you write the code, it's not too bad:

Look for: cf_role = "trajectory_id"
Determine what dimension is its first dimension
- that dimension is the "trajectory" dimension

The "observation" dimension can be called whatever you want.

More later ....

[*] -- the one exception I know of is that when a variable has the same name as a dimension, then it is a "coordinate" variable.

Which means that in the Trajectory format:

dimensions:
      obs = 3443;
      trajectory = 77 ;
   
   variables:
      char trajectory(trajectory, name_strlen) ;
            trajectory:cf_role = "trajectory_id";

The trajectory variable is a coordinate variable.

But you could call them both "traj" and that would be perfectly valid.

0 replies

gauteh · 2024-10-12T06:57:11Z

gauteh
Oct 12, 2024
Maintainer

In trajan we need to detect the data-layout automatically. Having tested this on several models and different drifter datasets I miss a unambiguous and definite way to know what the layout is. I would prefer the data-layout to be defined in an attribute, or even in a grid_mapping-like variable. Using the latter is kind of a hack that is adapted to netCDF, but it is used in other CF type datasets. It would allow the grid mapping variable to define the layout, the various coordinate variables, and which variables are positions (lon, lat). This would leave much less to chance.

A disadvantage with netCDF, Zarr, etc, is that they don't have a proper schema: it is very easy to create incorrectly defined datasets. Trajan already helps drifter-datasets in build CF-compliant datasets, but we should perhaps consider constructing CF-compliant xarray datasets that are suitable for models as well? It would depend on how easy it is to make it general across models needs and optimizations.

A few other points that came up when implementing this standard:

The multi-dimensional (obs, traj) format is the most complete representation, all formats can be converted to this one loss-lessly.
The multi-dimensional (obs, traj) has two-subesets, 1) full (two trajectories never share an observation, similar to ragged), 2) condensed (all trajectories start at observation 0). The latter is much more compact. See https://github.com/OpenDrift/trajan/blob/main/trajan/traj.py#L751. The latter could in theory loose some information about observation order (obs is only a dimension, not a coordinate), but in practice this is probably never an issue.

My plan for trajan is to have a normalize method to add necessary properties and add the standard variables attributes so that a datasets is compliant. I think trajan would be pretty useless if we didn't try to parse existing non-fully compliant datasets. I also think that making tools to verify the compliance is less useful than making tools that help others (or ourselves) build compliant datasets directly (see a few paragraphs up).

0 replies

ChrisBarker-NOAA · 2024-10-25T20:27:09Z

ChrisBarker-NOAA
Oct 25, 2024

Sorry I haven't had time to reply earlier, and I still don't have time to do it justice, but a couple quick notes:

In trajan we need to detect the data-layout automatically

Indeed, we all want that!

I miss a unambiguous and definite way to know what the layout is. I would prefer the data-layout to be defined in an attribute, or even in a grid_mapping-like variable. Using the latter is kind of a hack that is adapted to netCDF, but it is used in other CF type datasets. It would allow the grid mapping variable to define the layout, the various coordinate variables, and which variables are positions (lon, lat). This would leave much less to chance.

That is all the goal of CF -- and I think it's there, if you want to / can use the existing Trajectory specification.

It would be easier with a full grid_mapping, but it should be doable with:

trajectory:cf_role = "trajectory_id";

and all the other ways to determine coordinates, standard names, etc.

If not, then CF needs a new feature -- and we should propose one.

As far as software auto-detection goes:

it's a reality that folks don't always do CF, or do it right, but my recommended approach is one of:

First look for the cf compliant way to define everything.
-- only if that fails, do whatever hacks you need to infer what's there (common variable names, etc.) -- ideally warning the user that you guessed.

(that's what we do in gridded and PyGNOME for gridded model results)

or:

First process the file to make it CF compliant -- have the user specify what's what, and add the metadata, and/or infer what you can, and add the metadata.
Then the rest of your code can assume that you're working with CF compliant data.

That's what's being done in the xarray_subset_grid project.

Now that I've got a lot of experience with this, I recommend the second option -- it's much easier to be clear to your users -- it fails early, not buried in the depths of the code, and it simplifies the core code -- you can assume everything is as it should be.

My plan for trajan is to have a normalize method to add necessary properties and add the standard variables attributes so that a datasets is compliant.

That sounds like my second option -- I agree that that's a good way to go.

The multi-dimensional (obs, traj) has two-subesets, 1) full (two trajectories never share an observation, similar to ragged), 2) condensed (all trajectories start at observation 0). The latter is much more compact.

This is why I think we do need a extension to CF -- that way we could get the lossless conversion to a full array, and also a more compact and efficient storage layout for the common case of results from particle tracking models.

We have been doing that for years with PyGNOME.

I have recently started an xarray implimentation of it for working with this format. It's not quite complete, but the idea is sthat you can either:

work with the ragged storage directly, and have it "look" like a full array
or
call to_full_array method, and get a regular old 2-d array with a _FillVAlue set for teh missing data,
(also a to_ragged method that will convert from a regular 2D array for writting to file.

I think trajan could use (2) sooner than later to easily support these files.

Code here -- not complete, but I'd love help:

https://github.com/NOAA-ORR-ERD/nc_particles/tree/new_code

(look at the new_code) branch.

0 replies

knutfrode · 2024-11-14T12:48:10Z

knutfrode
Nov 14, 2024
Maintainer Author

Thank you very much for these inputs, @ChrisBarker-NOAA !

Yes, you must be right that the dimensions are not supposed to have fixed names (e.g. obs and trajectory as in the CF-examples), but should instead be detected dynamically as you describe.
Thus we have now implemented such detection in TrajAn, e.g. with this method to detect the trajectory dimension:

trajan/trajan/accessor.py

Line 25 in b75beb7

def detect_trajectory_dim(ds):

I see that you say that the trajectory-dimension is always the first dimension of the variable with cf_role=trajectory_id, thus this method can be simplified a little more.

The trajectory dimension is now stored by TrajAn as ds.traj.trajectory_dim,
and the time/obs dimension is stored as ds.traj.obs_dim

We have also made a new commandline utility to inspect the trajectory information: $ trajaninfo <file/URL>,
which givess the same output as from >>> print(ds.traj)
as demonstrated on a Parcels Zarr dataset here: https://opendrift.github.io/trajan/gallery/example_parcels.html#sphx-glr-gallery-example-parcels-py

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CF-specfication: improvements to the trajectory standard #123

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CF-specfication: improvements to the trajectory standard #123

knutfrode Oct 9, 2024 Maintainer

Replies: 6 comments

erikvansebille Oct 9, 2024

Introduction/rationale

erikvansebille Oct 9, 2024

ChrisBarker-NOAA Oct 11, 2024

gauteh Oct 12, 2024 Maintainer

ChrisBarker-NOAA Oct 25, 2024

knutfrode Nov 14, 2024 Maintainer Author

knutfrode
Oct 9, 2024
Maintainer

erikvansebille
Oct 9, 2024

erikvansebille
Oct 9, 2024

ChrisBarker-NOAA
Oct 11, 2024

gauteh
Oct 12, 2024
Maintainer

ChrisBarker-NOAA
Oct 25, 2024

knutfrode
Nov 14, 2024
Maintainer Author