Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backward compatibility of raw-converted dataset after xarray.DataTree migration #1420

Open
2 tasks
leewujung opened this issue Dec 27, 2024 · 12 comments
Open
2 tasks
Assignees
Milestone

Comments

@leewujung
Copy link
Member

leewujung commented Dec 27, 2024

Context

Part of the xarray-contrib/datatree to datatree to xarray.DataTree migration requires strict inheritance of the coordinates:
https://github.com/pydata/xarray/blob/main/DATATREE_MIGRATION_GUIDE.md#data-model-changes

There should just be 1 case of this in the current echopype-adapted SONAR-netCDF4 structure, under Platform/NMEA group, since the Platform/time1 coordinate is a subset of the Platform/NMEA/time1 coordinate -- Thanks @ctuguinay for finding this out!

One workaround is to remove the indexes associated of the coordinate, but that may introduce confusion down the road, so we should probably go with renaming the Platform/NMEA/time1 coordinate (in the work in #1419).

Issue

However, once this is fixed, we have a larger problem that the previous raw-converted zarr dataset cannot be opened by the new xarray.open_datatree functions.

Tasks

  • Use xarray.open_groups to circumvent the inheritance problem for users to open previously stored raw-converted data
  • Create a mechanism for users to optionally convert their data to the new version
@oftfrfbf
Copy link
Contributor

I will be taking a look at this issue. I am currently getting up to speed with xarray.DataTree and its use throughout the project. Will be starting out by getting caught up on the tickets leading up to this and then will circle back here and gather more requirements.

I am thinking it would be nice to know what domain of versions of echopype will need to be upgradable, e.g. v7.0 to v9.0 or v8.0 to v9.0, and if there there is any specific test data you would like me to work with.

@oftfrfbf oftfrfbf self-assigned this Dec 29, 2024
@leewujung
Copy link
Member Author

Thanks @oftfrfbf !

I am thinking it would be nice to know what domain of versions of echopype will need to be upgradable, e.g. v7.0 to v9.0 or v8.0 to v9.0, and if there there is any specific test data you would like me to work with.

I don't know exactly who has large volumes of data converted using previous echopype versions, but my guess would be that the current format (the one v0.9.0 generates) is the most widely used. To simplify things and also so that we don't get swamped by data versioning just yet, I think we can focus on something that can take v0.9.X-generated data (which uses the older xarray-contrib/datatree) and open it with echopype >=v0.10.0 (which uses the new xarray.DataTree).

I'll take a look at when the last format change was so that we can put an accurate statement in the documentation.

@oftfrfbf
Copy link
Contributor

oftfrfbf commented Jan 21, 2025

An update, I am making progress here:

https://github.com/oftfrfbf/echopype/tree/EP-1420

with opening the data via open_groups and converting with DataTree.from_dict(). I have a couple different test resources across major versions of echopype that I am trying to ensure work properly. It has taken some time to get things set up but now I'm troubleshooting the changes needed to the time1/nmea_time naming. I'll continue to update here with progress.

"Create a mechanism for users to optionally convert their data to the new version" ...I would presume that once the capability to open the older v0.9.0 DataTree files is enabled, users can just open the file and save same as normal w >=v0.9.2 — as long as it overwrites the old zarr store it will be in the proper convention?!

@leewujung
Copy link
Member Author

"Create a mechanism for users to optionally convert their data to the new version" ...I would presume that once the capability to open the older v0.9.0 DataTree files is enabled, users can just open the file and save same as normal w >=v0.9.2 — as long as it overwrites the old zarr store it will be in the proper convention?!

Yes that's what I was thinking!

@leewujung
Copy link
Member Author

Hey @oftfrfbf - Just wanted to let you know that @ctuguinay and I resolved quite a few bugs/warnings (though still pinned zarr<3 and netcdf<1.7.1), so please pull from main to get those updated in your fork/branch.

Do you think you may be able to finish this some time this week? With all the dependency lags, we really need to release asap, because at this moment a fresh environment with echopype will error out at import (see #1432) and produce various errors, which is very serious, but we can't release without the open_converted capability to open previously stored zarr files (this issue). If you see a workaround to sidestep this, let me know.

@oftfrfbf
Copy link
Contributor

oftfrfbf commented Feb 4, 2025

Hi @leewujung, sorry this is taking so long (political events in the federal & university world aren't helping with all the distractions). I have put together some test files I created by deploying echopype to varying test environments and generating zarr stores from a test file (only focused on >=v0.9.0):

Image

And I have this code so far (sorry it's messy):

echodata.py in the "from_file" method:

if XARRAY_ENGINE_MAP[suffix] == "zarr":  # Testing for legacy datatree
    # tree = open_datatree(  # open_datatree is for new datatree
    tree = open_groups(  # open_groups is for older obsolete data
        converted_raw_path,
        engine=XARRAY_ENGINE_MAP[suffix],
        **echodata.open_kwargs,
    )  # open_groups has '/', '/Environment', '/Platform', etc
    # newer 0.9.1 only has '/' ...older has '/'+'/Platform'+a bunch of others
    if (
        "/Platform/NMEA" in tree
    ):  # ......for v0.9.0.zarr there is only a root directory (when calibrate was done)
        platform_nmea = tree["/Platform/NMEA"]
        if "time1" in platform_nmea.coords:
            platform_nmea = platform_nmea.rename({"time1": "time_nmea"})
            print(platform_nmea)
            tree["/Platform/NMEA"] = platform_nmea
        tree = xr.DataTree.from_dict(tree)  # before this rewrite 'time1'
        tree.name = "root"
        print(tree)
    elif "/" in tree: # this is the calibrate.compute_Sv() situation that fails
        tree = xr.DataTree.from_dict(tree) 
        tree.name = "root"
    else:
        tree = open_datatree(
            converted_raw_path,
            engine=XARRAY_ENGINE_MAP[suffix],
            **echodata.open_kwargs,
        )  # open_groups has '/', '/Environment', '/Platform', etc
        tree.name = "root"

along with a test in test_echodata.py. Hopefully "time_nmea" is the correct variable name for the "time1" data?

I can open converted zarr files from v0.9.2 and greater without problems. I can also open converted zarr files from v0.9.0 and v0.9.1 and do the datatree conversion and everything ends up as expected. That said, if the v0.9.0-or-v0.9.1 echodata object had a method applied to it before saving, e.g. "calibrate.compute_Sv," then the structure is a little different and I am trying to figure out the proper procedure to convert the data in that situation. I think it's something simple like redefining the root node and I am getting closer but have yet to figure it out properly completely.

I am also a little unsure of how to proceed with opening the file. Right now I try to open any zarr files with "open_groups" first and if that doesn't return the expected legacy formatting then I proceed to open it with "open_datatree" ...that seems really inefficient to me, opening the file potentially twice in a row? Without adding additional parameters to control the flow, I am not sure how to properly determine if the file is a legacy format without first opening it first. It just doesn't seem like the right way to do things.

I can devote the rest of the week to getting this running properly. I am wondering if it looks like I am headed in the right direction though? Any thoughts?

@leewujung
Copy link
Member Author

leewujung commented Feb 4, 2025

Thanks @oftfrfbf for the updates! Sorry for the impacts of politics...

Hopefully "time_nmea" is the correct variable name for the "time1" data?

@ctuguinay and I had a discussion on this and settled on nmea_time since it is the timestamp of the NMEA datagrams, like ping_time is the timestamp of the pings. I think we'll need a thorough review on the time coordinate naming and inheritance issue some time soon -- since nmea_time may not be the only one running into the inheritance issue, so that we can update the data structure accordingly (which means breaking changes 😞).

I can open converted zarr files from v0.9.2 and greater without problems. I can also open converted zarr files from v0.9.0 and v0.9.1 and do the datatree conversion and everything ends up as expected. That said, if the v0.9.0-or-v0.9.1 echodata object had a method applied to it before saving, e.g. "calibrate.compute_Sv," then the structure is a little different and I am trying to figure out the proper procedure to convert the data in that situation. I think it's something simple like redefining the root node and I am getting closer but have yet to figure it out properly completely.

I am a little confused here - I thought this has more to do with if the data could be opened with aligned inherited coordinates (see here)? but I could be wrong.

I am also puzzled as to why performing a method on an Echodata object would change its structure. Unless the method modifies anything in place in memory? (open_groups opens with read-only access) What are the things you saw that got changed?

I am also a little unsure of how to proceed with opening the file. Right now I try to open any zarr files with "open_groups" first and if that doesn't return the expected legacy formatting then I proceed to open it with "open_datatree" ...that seems really inefficient to me, opening the file potentially twice in a row? Without adding additional parameters to control the flow, I am not sure how to properly determine if the file is a legacy format without first opening it first. It just doesn't seem like the right way to do things.

I agree that this could be inefficient. A few thoughts below:

  • if we force open_groups to only lazy-load data, it would at least be pretty fast for most files?
  • I wonder if there's a place in the saved zarr files, like a metadata entry that tells us what that is.

Let me know If it's helpful to hop on a quick call. This is so messy!

@oftfrfbf
Copy link
Contributor

oftfrfbf commented Feb 5, 2025

Thanks for the help @leewujung,

I am also puzzled as to why performing a method on an Echodata object would change its structure. Unless the method modifies anything in place in memory? What are the things you saw that got changed?

Sorry, I think I'm making this more confusing than it needs to be. I guess the distinction here is between opening and saving a echodata object versus doing something like calibrate.compute_Sv(echodata) and saving it. The former saves the echodata object, the later ends up as an Xarray DataSet — my confusion here was if the data, "ds_Sv" saved in a Zarr store for example, would need to be opened with echopype? Thinking this through again, my intuition is telling me no now, users would just open it with Zarr or Xarray.

Image

...

if we force open_groups to only lazy-load data, it would at least be pretty fast for most files?

Yes, I am not seeing any noticable overhead to open the file twice with the method I gave previously. I think the only complications that could be associated with that would be if a chunking scheme is undefined by the Zarr engine when opening the file with dask in the background. I don't think this is something we need to take into consideration. I will opt to keep things as they are then.

I wonder if there's a place in the saved zarr files, like a metadata entry that tells us what that is.

Nothing obvious is jumping out to me in the .zmetadata as I look through it.

I think that I have what I need to move forward. I'll pull your latest changes from the main branch and put things together in a pull request.

@leewujung
Copy link
Member Author

Sorry, I think I'm making this more confusing than it needs to be. I guess the distinction here is between opening and saving a echodata object versus doing something like calibrate.compute_Sv(echodata) and saving it. The former saves the echodata object, the later ends up as an Xarray DataSet — my confusion here was if the data, "ds_Sv" saved in a Zarr store for example, would need to be opened with echopype? Thinking this through again, my intuition is telling me no now, users would just open it with Zarr or Xarray.

Yep, results of compute_Sv is a standard xarray dataset without any hierarchical structures (i.e. a "flat" dataset), so there should be no compatibility issues.

Yes, I am not seeing any noticable overhead to open the file twice with the method I gave previously. I think the only complications that could be associated with that would be if a chunking scheme is undefined by the Zarr engine when opening the file with dask in the background. I don't think this is something we need to take into consideration. I will opt to keep things as they are then.

If there's slowdown it will likely happen with larger files, like with a combined EchoData object. Most of the EK files are pretty small so the difference may not be noticeable. This is likely ok as we go through the transition. If someone has very large files and can't handle it, hopefully we'll hear about it.

@oftfrfbf
Copy link
Contributor

oftfrfbf commented Feb 5, 2025

If there's slowdown it will likely happen with larger files, like with a combined EchoData object. Most of the EK files are pretty small so the difference may not be noticeable. This is likely ok as we go through the transition. If someone has very large files and can't handle it, hopefully we'll hear about it.

The only other alternative I can really think of is to parse the semantic version for the provenance "conversion_software_version" in the root .zmetadata for the semantic version and use that as a conditional where for <0.9.2 use "open_groups" else use "open_datatree".

@leewujung
Copy link
Member Author

The only other alternative I can really think of is to parse the semantic version for the provenance "conversion_software_version" in the root .zmetadata for the semantic version and use that as a conditional where for <0.9.2 use "open_groups" else use "open_datatree".

I was thinking about this too yesterday -- in the Provenance group there's a conversion_software_version so one can get it from there and don't have to go to zmetadata. I ditched that idea, because this assumes that for echopype<0.9.2 the xarray version used was always before xr.datatree came into place (since ultimately it is xarray that writes the zarr file, not echopype). This is probably true, since the xarray was relatively recent, but unless we can explicitly check the xarray version used to write the file, there's some risk in here -- but actually, this is probably written in zmetadata?

@oftfrfbf
Copy link
Contributor

oftfrfbf commented Feb 5, 2025

Submitted a pull request here #1447

but actually, this is probably written in zmetadata?

Yeah, technically it could be read from the .zmetadata json via some jq library pretty easily. I am not sure there is any analog approach for netcdf files though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

2 participants