-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backward compatibility of raw-converted dataset after xarray.DataTree
migration
#1420
Comments
I will be taking a look at this issue. I am currently getting up to speed with xarray.DataTree and its use throughout the project. Will be starting out by getting caught up on the tickets leading up to this and then will circle back here and gather more requirements. I am thinking it would be nice to know what domain of versions of echopype will need to be upgradable, e.g. v7.0 to v9.0 or v8.0 to v9.0, and if there there is any specific test data you would like me to work with. |
Thanks @oftfrfbf !
I don't know exactly who has large volumes of data converted using previous echopype versions, but my guess would be that the current format (the one v0.9.0 generates) is the most widely used. To simplify things and also so that we don't get swamped by data versioning just yet, I think we can focus on something that can take v0.9.X-generated data (which uses the older I'll take a look at when the last format change was so that we can put an accurate statement in the documentation. |
An update, I am making progress here: https://github.com/oftfrfbf/echopype/tree/EP-1420 with opening the data via open_groups and converting with DataTree.from_dict(). I have a couple different test resources across major versions of echopype that I am trying to ensure work properly. It has taken some time to get things set up but now I'm troubleshooting the changes needed to the time1/nmea_time naming. I'll continue to update here with progress. "Create a mechanism for users to optionally convert their data to the new version" ...I would presume that once the capability to open the older v0.9.0 DataTree files is enabled, users can just open the file and save same as normal w >=v0.9.2 — as long as it overwrites the old zarr store it will be in the proper convention?! |
Yes that's what I was thinking! |
Hey @oftfrfbf - Just wanted to let you know that @ctuguinay and I resolved quite a few bugs/warnings (though still pinned zarr<3 and netcdf<1.7.1), so please pull from Do you think you may be able to finish this some time this week? With all the dependency lags, we really need to release asap, because at this moment a fresh environment with echopype will error out at import (see #1432) and produce various errors, which is very serious, but we can't release without the |
Hi @leewujung, sorry this is taking so long (political events in the federal & university world aren't helping with all the distractions). I have put together some test files I created by deploying echopype to varying test environments and generating zarr stores from a test file (only focused on >=v0.9.0): ![]() And I have this code so far (sorry it's messy): echodata.py in the "from_file" method: if XARRAY_ENGINE_MAP[suffix] == "zarr": # Testing for legacy datatree
# tree = open_datatree( # open_datatree is for new datatree
tree = open_groups( # open_groups is for older obsolete data
converted_raw_path,
engine=XARRAY_ENGINE_MAP[suffix],
**echodata.open_kwargs,
) # open_groups has '/', '/Environment', '/Platform', etc
# newer 0.9.1 only has '/' ...older has '/'+'/Platform'+a bunch of others
if (
"/Platform/NMEA" in tree
): # ......for v0.9.0.zarr there is only a root directory (when calibrate was done)
platform_nmea = tree["/Platform/NMEA"]
if "time1" in platform_nmea.coords:
platform_nmea = platform_nmea.rename({"time1": "time_nmea"})
print(platform_nmea)
tree["/Platform/NMEA"] = platform_nmea
tree = xr.DataTree.from_dict(tree) # before this rewrite 'time1'
tree.name = "root"
print(tree)
elif "/" in tree: # this is the calibrate.compute_Sv() situation that fails
tree = xr.DataTree.from_dict(tree)
tree.name = "root"
else:
tree = open_datatree(
converted_raw_path,
engine=XARRAY_ENGINE_MAP[suffix],
**echodata.open_kwargs,
) # open_groups has '/', '/Environment', '/Platform', etc
tree.name = "root" along with a test in test_echodata.py. Hopefully "time_nmea" is the correct variable name for the "time1" data? I can open converted zarr files from v0.9.2 and greater without problems. I can also open converted zarr files from v0.9.0 and v0.9.1 and do the datatree conversion and everything ends up as expected. That said, if the v0.9.0-or-v0.9.1 echodata object had a method applied to it before saving, e.g. "calibrate.compute_Sv," then the structure is a little different and I am trying to figure out the proper procedure to convert the data in that situation. I think it's something simple like redefining the root node and I am getting closer but have yet to figure it out properly completely. I am also a little unsure of how to proceed with opening the file. Right now I try to open any zarr files with "open_groups" first and if that doesn't return the expected legacy formatting then I proceed to open it with "open_datatree" ...that seems really inefficient to me, opening the file potentially twice in a row? Without adding additional parameters to control the flow, I am not sure how to properly determine if the file is a legacy format without first opening it first. It just doesn't seem like the right way to do things. I can devote the rest of the week to getting this running properly. I am wondering if it looks like I am headed in the right direction though? Any thoughts? |
Thanks @oftfrfbf for the updates! Sorry for the impacts of politics...
@ctuguinay and I had a discussion on this and settled on
I am a little confused here - I thought this has more to do with if the data could be opened with aligned inherited coordinates (see here)? but I could be wrong. I am also puzzled as to why performing a method on an Echodata object would change its structure. Unless the method modifies anything in place in memory? (open_groups opens with read-only access) What are the things you saw that got changed?
I agree that this could be inefficient. A few thoughts below:
Let me know If it's helpful to hop on a quick call. This is so messy! |
Thanks for the help @leewujung,
Sorry, I think I'm making this more confusing than it needs to be. I guess the distinction here is between opening and saving a echodata object versus doing something like calibrate.compute_Sv(echodata) and saving it. The former saves the echodata object, the later ends up as an Xarray DataSet — my confusion here was if the data, "ds_Sv" saved in a Zarr store for example, would need to be opened with echopype? Thinking this through again, my intuition is telling me no now, users would just open it with Zarr or Xarray. ![]() ...
Yes, I am not seeing any noticable overhead to open the file twice with the method I gave previously. I think the only complications that could be associated with that would be if a chunking scheme is undefined by the Zarr engine when opening the file with dask in the background. I don't think this is something we need to take into consideration. I will opt to keep things as they are then.
Nothing obvious is jumping out to me in the .zmetadata as I look through it. I think that I have what I need to move forward. I'll pull your latest changes from the main branch and put things together in a pull request. |
Yep, results of
If there's slowdown it will likely happen with larger files, like with a combined EchoData object. Most of the EK files are pretty small so the difference may not be noticeable. This is likely ok as we go through the transition. If someone has very large files and can't handle it, hopefully we'll hear about it. |
The only other alternative I can really think of is to parse the semantic version for the provenance "conversion_software_version" in the root .zmetadata for the semantic version and use that as a conditional where for <0.9.2 use "open_groups" else use "open_datatree". |
I was thinking about this too yesterday -- in the |
Submitted a pull request here #1447
Yeah, technically it could be read from the .zmetadata json via some jq library pretty easily. I am not sure there is any analog approach for netcdf files though. |
Context
Part of the
xarray-contrib/datatree
todatatree to xarray.DataTree
migration requires strict inheritance of the coordinates:https://github.com/pydata/xarray/blob/main/DATATREE_MIGRATION_GUIDE.md#data-model-changes
There should just be 1 case of this in the current echopype-adapted SONAR-netCDF4 structure, under
Platform/NMEA
group, since thePlatform/time1
coordinate is a subset of thePlatform/NMEA/time1
coordinate -- Thanks @ctuguinay for finding this out!One workaround is to remove the indexes associated of the coordinate, but that may introduce confusion down the road, so we should probably go with renaming the
Platform/NMEA/time1
coordinate (in the work in #1419).Issue
However, once this is fixed, we have a larger problem that the previous raw-converted zarr dataset cannot be opened by the new
xarray.open_datatree
functions.Tasks
xarray.open_groups
to circumvent the inheritance problem for users to open previously stored raw-converted dataThe text was updated successfully, but these errors were encountered: