-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: When appending to file with newer schema, the older schema is not deleted #1098
Comments
@stephprince I assigned you this because you'll be working with namespaces and validation soon, and this is related. |
This assumes that the schema (not just the API are compatible). E.g., let's say you stored an |
Under that example, different objects of the NWB file are valid only under different schema. The file could fail validation (which validates all objects under one schema) under both the new and old schema. Allowing this would open a big can of edge case worms. I would rather be more restrictive and say that saving an NWB file using the newest schema is allowed only if all objects are valid in the newest schema. The new schema may not be compatible with the old schema, though I can't think of a great example of that right now. We could validate before write to check this, which is something we were thinking of doing anyway. |
So that means validating before updating a file to the newer schema? |
If the file with an older schema is opened in pynwb/hdmf that has loaded a newer schema, probably we should validate the file against the newer schema in case there are weird discrepancies that are not accounted for. During the write process, we should validate the builders against the newer schema and not allow the user to proceed if something is invalid (this usually means the API is buggy because it allowed the user to create non-conforming builders). |
@rly do If I understand this correctly, the proposal is effectively to create a migration path to upgrade a file a new schema |
@oruebel Effectively, yes. Here is a concrete use case: In NWB 2.1.0, Before we can delete the pre-2.1.0 schema from the cache, we should migrate the no-longer-valid scalar "experimenter" value to a valid 1D array in memory and write that change to the file on write. |
We can use the |
This may not be trivial. Changing the data type, shape etc. of datasets and attributes is not possible in HDF5. I think one would need to first remove the existing dataset (or attribute) in that case and create a new one with the same name. |
related to : #1018 (comment) why not separate the actions of appending and migrating? It seems like the real issue is that it's possible (and in fact impossible not to as far as i can tell) to use the wrong version of a schema with a file. if, when opening a file for writing, the schema version that's embedded with the file is the one that's loaded by pynwb/hdmf, then the problem disappears? then if we wanted to do migrations then it would be an explicit, rather than accidental process - a file uses the schema that it explicitly packages with it until the user explicitly acts to change that, eg. via a migration system. implicitly writing data with a newer schema by accident seems like highly undesirable behavior, as does implicit/automatic migration. |
revisited this issue bc the issue of handling multiple schema versions simultaneously came up for me again, and realize the last thing that the discussion was left on.
if it's helpful & this is still something that would be desirable to do, i had to do this recently to make truncated sample files for testing, and the hardest part was preserving the references, because when you remove and recreate the dataset all the references to it are destroyed. this is a relatively cumbersome function since it's intended to be a cli command (once i write the cli module), so sorry for the |
What happened?
See flatironinstitute/neurosift#157
Currently, a file may contain multiple versions of a schema, e.g., "core" 1.5.0 and "core" 1.6.0 after a user appends to a file that has an older version of a namespace cached than the one currently loaded by pynwb/hdmf. I don't see the value of that, given that individual neurodata type objects are attached to a namespace and not a namespace & version. It adds confusion.
I think, when writing the file and the schema is cached, the older version of a namespace should be deleted and the newer one added.
This raises the use case where the existing file has a cached schema and the user chooses not to cache the new schema., but the data are written with the new schema. We should not allow that. Is there a use case where we don't want to cache the schema on write? Should we just disallow that to prevent edge cases during append/modification like this? I think so.
Related: NeurodataWithoutBorders/pynwb#967 which mentions possible performance slowdowns, but they seem minor, I think the value of caching the schema trumps that, and we can optimize the writing of the schema separately - it is just a group with a few string datasets.
Steps to Reproduce
Traceback
No response
Operating System
macOS
Python Executable
Conda
Python Version
3.12
Package Versions
No response
The text was updated successfully, but these errors were encountered: