Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate h5py soft and hard linked datasets with an optional kwarg #463

Merged
merged 14 commits into from
Jul 1, 2024

Conversation

ljwoods2
Copy link
Contributor

@ljwoods2 ljwoods2 commented Jun 16, 2024

Fixes #459

Linked groups can't be translated as easily as linked datasets- the h5py method visititems_links is only called once for
each link. This means that if you have this hdf5 layout for example:

/
 └── particles
     └── trajectory
         ├── box
         │   └── edges
         │       ├── step (50,) int32
         │       ├── time (50,) float32
         │       └── value (50, 3, 3) float32
         └── position
             └── value (50, 13406, 3) float32

And you try to create this hard link and translate it:

f["particles/trajectory2"] = f["particles/trajectory"]

Then visititems_links will be called on "box/edges/step", "time", and "value" during the traversal of "particles/trajectory" and therefore can't be traversed again during the traversal of "particles/trajectory2", resulting in an empty translated group. This behavior makes sense since it avoids the problem of circularly-linked groups, but it makes translating groups with h5py-builtin traversal methods difficult. If zarr-python 3.0 implements links, visititems and visititems_links may have to be replaced with custom directory traversal code as a result

This PR only allows translating linked datasets by traversing the h5py directory twice and using require_dataset instead of create_dataset to avoid duplicating non-linked groups and datasets.

@martindurant
Copy link
Member

I'll look into the test failure, which looks unrelated. Maybe something about numpy 2, which just came out.

@martindurant
Copy link
Member

(update: I have fixed one test failure so far, so one more to go)

@martindurant
Copy link
Member

If you sync your branch, things should pass now.

@hmacdope
Copy link

@martindurant thanks for helping @ljwoods2 with this! This is part of his GSOC 2024 work with MDAnalysis, which I am co-supervising. With that hat on would you be able to approve the workflows here by any chance? Getting this merged would help us a lot moving forward. Let me know if I can help in any way. 😄

@martindurant
Copy link
Member

Sorry, new commits by themselves don't send a notification. Would you like me to add a label to this PR for gsoc?

kerchunk/hdf.py Outdated Show resolved Hide resolved
kerchunk/hdf.py Outdated Show resolved Hide resolved
kerchunk/hdf.py Show resolved Hide resolved
@ljwoods2
Copy link
Contributor Author

I switched out the string version compatibility with a direct check of if the visititems_links method is present- not sure if this is the most pythonic way to do it, let me know what you think. Also integrated your other suggestions

kerchunk/tests/test_hdf.py Outdated Show resolved Hide resolved
kerchunk/hdf.py Outdated Show resolved Hide resolved
kerchunk/hdf.py Outdated Show resolved Hide resolved
kerchunk/hdf.py Outdated Show resolved Hide resolved
kerchunk/hdf.py Outdated Show resolved Hide resolved
kerchunk/hdf.py Outdated Show resolved Hide resolved
@martindurant martindurant merged commit 061bf98 into fsspec:main Jul 1, 2024
5 checks passed
@hmacdope
Copy link

hmacdope commented Jul 8, 2024

Thanks a lot @martindurant! Appreciate your help here 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kerchunk doesn't translate HDF5 hard links
3 participants