Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash get_object_state, use dir for native types, fix __slots__ #208

Merged
merged 11 commits into from
Oct 11, 2024

Conversation

albertz
Copy link
Member

@albertz albertz commented Oct 8, 2024

Fix #207

See #207 for some discussion.

Note that this slightly changes the behavior, but only for cases which were basically broken before.

E.g., for functools.partial, before this change, none of the relevant args (func, args, keywords) were ever part of the hash. So this change should not really break anything.

The same is also true for cases where there was both a __dict__ and __slots__. With Python <=3.10, it would have used only the __dict__, and very likely, that was just empty, and does not contain any of the relevant things (which are very likely all in the __slots__), so it was totally broken and basically unusable. With Python >=3.11, there is a default __getstate__ now, which would already do the correct thing, at least for "normal" (non-native) types, i.e. merge __dict__ and __slots__, i.e. our existing get_object_state was correct for those cases. With this change, Python <=3.10 is also correct now.

In any case, we check through dir now, and we check for member descriptors, which should also cover such native types (like functools.partial).

@albertz albertz marked this pull request as draft October 8, 2024 10:35
@albertz albertz marked this pull request as ready for review October 8, 2024 18:03
@NeoLegends
Copy link
Contributor

NeoLegends commented Oct 9, 2024

I'm running apptek_asr CI w/ the latest commit here to see if AppTek was affected by the bug.

@albertz
Copy link
Member Author

albertz commented Oct 10, 2024

Btw, I just noticed, our extract_paths has the same problem. I actually wonder, why does it not use get_object_state, but instead its own logic? Should I also fix this in this PR here, or make a separate PR for this (I think I would just use get_object_state there)?

-> @curufinwe said separate PR.

@NeoLegends
Copy link
Contributor

NeoLegends commented Oct 10, 2024

So, it seems like w/ the new implementation pathlib.Path hashes differently. Old sis gave:

>>> p = pathlib.Path("/home/mgunz/src/apptek_asr/[...]/test-hash.py")
>>> from sisyphus.hash import get_object_state
>>> get_object_state(p)
{}

New sis gives:

>>> p = pathlib.Path("/home/mgunz/src/apptek_asr/[...]/test-hash.py")
>>> from sisyphus.hash import get_object_state
>>> get_object_state(p)
{'_drv': '', '_parts': ['/', 'home', 'mgunz', 'src', 'apptek_asr', ..., 'test-hash.py'], '_root': '/'}

The reason for this is that if you instantiate a pathlib.Path, on unix systems you get

class PosixPath(Path, PurePosixPath):
    """Path subclass for non-Windows systems.

    On a POSIX system, instantiating a Path should return this object.
    """
    __slots__ = ()

Notice the empty slots. Before this PR, sisyphus would therefore not take into account any properties for the hash. get_object_state(pathlib.Path) would give you (PosixPath, (dict)). Now, w/ the changes you instead get PosixPath, (dict, (tuple, (str, '_cached_cparts'), (list, (str, '/'), (str, 'nas'), (str, 'data'), (str, 'speech'), ...)), (tuple, (str, '_drv'), (str, '')), (tuple, (str, '_parts'), (list, (str, '/'), (str, 'nas'), (str, 'data'), (str, 'speech'), ...)), (tuple, (str, '_root'), (str, '/')))).

All in all the previous behavior was extremely broken, so I think we should merge this as-is and accept the breakage.

@albertz
Copy link
Member Author

albertz commented Oct 10, 2024

Ah, PosixPath seems to behave somewhat bad in both cases though (although some state is obviously better than no state at all). The state you get there with the new code doesn't really seem to be so stable across Python versions.

But I also wonder, the __slots__ is always empty? That means, all the state should be in __dict__? So then the old code should have worked already, or not?
__slots__ is not empty. It derives from PurePath, and that defines:

    __slots__ = (
        '_drv', '_root', '_parts',
        '_str', '_hash', '_pparts', '_cached_cparts',
    )

Btw, also please say what Python version you used. For the new code, the Python version should not make a difference (well, despite if the object changed its internal representation - that might also be the case for PosixPath, I don't know). For the old code, it will also be different depending on Python version. With Python >=3.11, I think the state was not empty for PosixPath. Maybe check that as well.

In any case, maybe it makes sense to treat PosixPath as a special case anyway, and just store its path as state? (But that would be a separate PR then.)

Also, I wonder, is that correct that you have PosixPath as argument here? Maybe you actually want to have a Sisyphus Path instead?

@curufinwe
Copy link
Collaborator

In general, yes. A sisyphus Path would be preferable, but we should make sure that sisyphus also works correctly with PosixPath. I would not mind handling it separately, but then we should do it in this PR to avoid having another commit than changes hashes.

@michelwi
Copy link
Contributor

The python version was 3.10 in the case in question.

@albertz
Copy link
Member Author

albertz commented Oct 10, 2024

Ok, I added special case for pathlib.PurePath.

@albertz albertz merged commit 9e597db into master Oct 11, 2024
3 checks passed
@albertz albertz deleted the albert-fix-partial-state-207 branch October 11, 2024 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hash (get_object_state) of functools.partial is wrong
4 participants