Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type hints to db classes #1919

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

stevenyoungs
Copy link
Contributor

@stevenyoungs stevenyoungs commented Feb 3, 2025

Add type hints to the various db classes.

This builds upon, and requires, a minimal set of code changes introduced as PR #1934. (These changes were originally included in this PR)

All type hints are in the final commit, which only contains type hints.

@stevenyoungs
Copy link
Contributor Author

I've created this PR whilst it is still a work in progress to get some input on the impact.
DbReadBase.get_person_from_handle has been annotated to return Person | None
Why None, because ProxyDbBase.get_person_from_handle can return None.
If None is not included, mypy gives the error
gramps/gen/proxy/proxybase.py:516: error: Return type "Person | None" of "get_person_from_handle" incompatible with return type "Person" in supertype "DbReadBase" [override]

Having done that, code such as this

gramps/gramps/gen/db/base.py

Lines 1959 to 1962 in 1280aa4

person = self.get_person_from_handle(person_handle)
family = self.get_family_from_handle(family_handle)
person.remove_family_handle(family_handle)

now gets (correctly) flagged as an error because it just uses person even if it is None

gramps/gen/db/base.py:2023: error: Item "None" of "Person | None" has no attribute "remove_family_handle" [union-attr]

The solution I've taken so far in this PR is to amend the code to handle a return of None correctly. In this situation to protect the code with if person and family:

Before I go any further, is that the correct approach or is there a better way?

@dsblank
Copy link
Member

dsblank commented Feb 4, 2025

I thought about alternative approaches where None is never a valid return value.

One way would return, say, a RedactedPerson() rather than None. But I don't think that won't work very well for the intended purpose. For example, if two people have an illegitimate child marked "private" then you wouldn't want to see any trace of them in any output.

Another approach would be to remove the handle from the reference object. It might be possible to pre-process the entire DB, caching the proper references and removing anything pointing to a private object, and replacing sensitive values, like "Living". This would probably be faster and less error prone in the end.

@kulath
Copy link
Member

kulath commented Feb 4, 2025

(Note, I haven't tried to understand the logic in this particular case, so my comment is just on general principles).

If the code doesn't currently try to deal with a return of None, then presumably the only way None can occur is through a database corruption (or something of that sort). If corruption has occurred, the it should be flagged as soon as possible. I think this is the point responded to in #1913 (comment), where AIUI the database does a check and raises HandleError , if it doesn't then the earliest point to raise the error would be in the database, which should be done.

I don't think the code should be guarded with if person and family, because that isn't really the correct logic if None should never occur, (and could conceivably cause problems further downstream).

The problem seems (to me) to be that a Proxy could correctly return None, but outside a proxy thay could not happen (except for a database fault of a broken link). So would the correct thing be to have a different method for the call in a proxy (or an additional parameter) to say that None was a legitimate return, and then MyPy could do the correct check. - I have no idea whether this would solve the problem or whether it is possible given that the problem is with inheritance, or whether this would involve too much duplication of code!

@kulath
Copy link
Member

kulath commented Feb 4, 2025

How about using Assert person is not None?

Am I right in thinking that Assert is only executed if Debug is on. In which case, this would not impose a performance penalty. However, it would provide documentation of what was expected, and presumably would allow the static type checking to work correctly. If is is only on with debug, then the CI test modules should all be run with debug on (I don't know whether this is done or not).

@stevenyoungs stevenyoungs force-pushed the typing-db branch 2 times, most recently from e72da20 to f0a828a Compare February 4, 2025 20:29
@stevenyoungs
Copy link
Contributor Author

Another approach would be to remove the handle from the reference object. It might be possible to pre-process the entire DB, caching the proper references and removing anything pointing to a private object, and replacing sensitive values, like "Living". This would probably be faster and less error prone in the end.

I was trying to restrict this PR to just adding type hints. I've not quite succeeded but I'm reluctant to make larger scale code changes if I can avoid it.

@stevenyoungs
Copy link
Contributor Author

The problem seems (to me) to be that a Proxy could correctly return None, but outside a proxy thay could not happen (except for a database fault of a broken link).

As I understand it a proxy db is substitutable for the real db. So all higher level (business logic if you like) code should be written to correctly handle a None return. If we add an alternative set of methods, which can return None, won't we still have to support a None return type in much of the code?

"""
Return an iterator over database handles, one handle for each Person in
the database.
"""
return filter(self.include_person, self.db.iter_person_handles())
return (_ for _ in filter(self.include_person, self.db.iter_person_handles()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        return (_ for _ in filter(self.include_person, self.db.iter_person_handles()))

Flagging this change for comment.
Method: def iter_person_handles(self) -> Generator[PersonHandle]

This code previously returned a List. The equivalent method in DbGeneric returns a Generator[PersonHandle].
I've standardised on returning a generator. Is this the right choice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

@@ -78,10 +105,12 @@ class DBAPI(DbGeneric):
Database backends class for DB-API 2.0 databases
"""

dbapi: Any # might be better to have a specific ConnectionBase class
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    dbapi: Any  # might be better to have a specific ConnectionBase class

Flagging this line for comment.
It's not correct to type hint dbapi as Any but I could not find an appropriate superclass to use.

@stevenyoungs
Copy link
Contributor Author

Cursor functions such as get_person_cursor don't have a consistent return type so I'm unsure how to correctly type.
Suggestions welcome

Class Return Type
DbReadBase
DummyDb []
DbGeneric Cursor
ProxyDbBase ProxyCursor

@kulath
Copy link
Member

kulath commented Feb 5, 2025

Having done that, code such as this

gramps/gramps/gen/db/base.py

Lines 1959 to 1962 in 1280aa4

person = self.get_person_from_handle(person_handle)
family = self.get_family_from_handle(family_handle)
person.remove_family_handle(family_handle)

now gets (correctly) flagged as an error because it just uses person even if it is None
gramps/gen/db/base.py:2023: error: Item "None" of "Person | None" has no attribute "remove_family_handle" [union-attr]

The solution I've taken so far in this PR is to amend the code to handle a return of None correctly. In this situation to protect the code with if person and family:

Before I go any further, is that the correct approach or is there a better way?

I don't see how protecting the code locally with if family etc and then 'carrying on' can be the right thing. Quite apart from the fact that it is a change to the logic (which you are reasonably unwilling to do), you just don't know what the rest of business logic will be doing and whether it will be consistent later on.

When Nick wrote that

The problem in the past was that the base database layer also returned None in the case of broken links. As a result, real errors were missed. This has long been fixed and the database now raises a HandleError in such circumstances.

I presumed that he meant that anywhere that None was returned was already guarded by testing for None and doing the right thing. When you said

As I understand it a proxy db is substitutable for the real db. So all higher level (business logic if you like) code should be written to correctly handle a None return. If we add an alternative set of methods, which can return None, won't we still have to support a None return type in much of the code?

From what Nick had said, I imagined that the code is effectively divided into code that can be called with a proxy db, and code that cannot be called with a proxy db. And I imagined that all the code that could be called from a proxy db already dealt with the possibility that None might be returned. Hence I would assume that the code you quote simply cannot be called with a proxy db, and hence it can safely assume that None is not possible.

So, I think that either you should have a statement like
if family is None: raise <some exception> or
assert family is not None

@dsblank
Copy link
Member

dsblank commented Feb 5, 2025

I imagined that the code is effectively divided into code that can be called with a proxy db, and code that cannot be called with a proxy db

This is incorrect. Much code can be called with a limiting proxy (Private, Living) in place. This most likely happens in reports, filters, some utils (like alive.py), exports, and tools.

@dsblank
Copy link
Member

dsblank commented Feb 5, 2025

@stevenyoungs I think your approach is sound. We can work on proxies and their ability to return None in another initiative.

Comment on lines 1215 to 1233
# Derived fields
if table == "Person":
given_name, surname = self._get_person_data(obj)
given_name, surname = self._get_person_data(cast(Person, obj))
sets.append("given_name = ?")
values.append(given_name)
sets.append("surname = ?")
values.append(surname)
if table == "Place":
handle = self._get_place_data(obj)
handle = self._get_place_data(cast(Place, obj))
sets.append("enclosed_by = ?")
values.append(handle)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better if this were written as
if type(obj) is Person
etc.
(If I am right that this works the same)

I appreciate that this would be changing existing code, but surely no more than adding cast. The existing code is not very clear - accessing the type through table = obj.__class__.__name__ is complex and obscure, the conversion to name is not really necessary, and table is not a good name for the variable.

Testing the type in the if statement would (AIUI) automatically provide the static type checking, and if anyone wanted to add more code in the future, that wouldn't need to be type confirmed by something like a cast.

@kulath
Copy link
Member

kulath commented Feb 5, 2025

@stevenyoungs I think your approach is sound. We can work on proxies and their ability to return None in another initiative.

Well, if you are going to work on the ability of proxies to return None at another time, don't do the fix of adding the if statement now, do something like an assert for now, mark it with a fixme, and then the consequences of testing for None can be properly evaluated at some later time. Adding the if statement now makes it look as though all the consequences (on the calling routines) have been evaluated now. (I'm thinking about other places where something similar occurs, not just this exact piece of code, and the need for a common approach).

"""
Remove a person as either the father or mother of a family,
deleting the family if it becomes empty; trans is compulsory.
"""
person = self.get_person_from_handle(person_handle)
family = self.get_family_from_handle(family_handle)
if person and family:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put code changes in a separate PR. Keep this PR for adding type hints only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. The code changes will need to be applied first. The type hints on their own will not pass mypy; the code changes are the minimal set strictly required for mypy to pass.

@dsblank
Copy link
Member

dsblank commented Feb 5, 2025

As @stevenyoungs mentioned, he is only adding type hints in this PR (and adjusting code just enough to pass mypytests).

If someone reports a crash because some general code is failing because it isn't properly handling proxies that return None, the immediate fix is to add code that checks the return value. A complete rethink of proxies is going to take some time and energy.

@Nick-Hall Nick-Hall added this to the v6.1 milestone Feb 5, 2025
@kulath
Copy link
Member

kulath commented Feb 6, 2025

If someone reports a crash because some general code is failing because it isn't properly handling proxies that return None, the immediate fix is to add code that checks the return value.

Maybe that 'immediate fix' is right, and maybe it isn't.

I'm sorry to point out something that I am sure you already know! I'm sure you don't mean only just blindly patch the code where the crash manifests itself.

Not handling proxies that return None is a bug, and you don't fix bugs by patching the symptom. You reproduce the fault, and then test whether the patch you propose fixes everything. You need to make sure that not only the place where the crash first occurs is OK, but also any other calling or called code also behaves correctly.

That's why I don't think you should just add an if statement to skip part of the code. That may well turn out to be the necessary and sufficient solution to the problem, but it will probably take some time and energy (and perhaps another initiative) to determine it.

I'm really excited about the static typing and especially how it is forcing clarification of the code, as mentioned by several of the comments in this PR. However, if the code was going to crash before, then it should still crash after the type checking has been added. Hence my suggestion of if family is None: raise <some exception> or assert family is not None (and add a fixme) to still crash, but to satisfy the static type checking.

@stevenyoungs
Copy link
Contributor Author

Hence my suggestion of if family is None: raise <some exception> or assert family is not None (and add a fixme) to still crash, but to satisfy the static type checking.

The problem is that the type hinted code does not pass type checking unless you modify the code to handle a None return.

@stevenyoungs
Copy link
Contributor Author

As suggested, I've split the code changes out into a separate PR #1934
This PR is now the final commit, which contains type hints only.

@Nick-Hall
Copy link
Member

Nick-Hall commented Feb 6, 2025

I think that allowing None as a return value from get_person_from_handle is a mistake. The type checking is highlighting the fact that we have a design issue, which we already know about, but the solution is to prevent proxies from returning None, not fixing code that isn't broken.

A better approach would be to make the proxies return empty objects instead. Then a check like if person is None: would become if person.is_empty():. If the check was omitted by mistake then we would just see an object with all details redacted.

This doesn't actually fix the problem with proxies leaking information, but that could be dealt with at a later date.

@stevenyoungs
Copy link
Contributor Author

if person.is_empty():

Is there a heuristic that can be used to check if a PrimaryObject is empty?
no handle and no gramps_id perhaps?

@Nick-Hall
Copy link
Member

Nick-Hall commented Feb 6, 2025

An empty object would have its handle set to None. An object from the database would always have a handle.

if person.handle is None: would work equally as well as if person.is_empty(): but creating an is_empty method might be neater and easier to understand.

@kulath
Copy link
Member

kulath commented Feb 7, 2025

I think that allowing None as a return value from get_person_from_handle is a mistake. The type checking is highlighting the fact that we have a design issue, which we already know about,

Agreed!

but the solution is to prevent proxies from returning None, not fixing code that isn't broken.

A better approach would be to make the proxies return empty objects instead. Then a check like if person is None: would become if person.is_empty():. If the check was omitted by mistake then we would just see an object with all details redacted.

This doesn't actually fix the problem with proxies leaking information, but that could be dealt with at a later date.

Doesn't this just move the crash to a different place? If you return an empty object, then isn't it possible that the next piece of code will try to use the non-existent handle and consequently crash? I suppose the question we might ask is "does this handle point to an accessible object?" before trying to call get_person_from_handle.

The problem is that the type hinted code does not pass type checking unless you modify the code to handle a None return.

If you modify the code to add an assert, will this pass type checking?

My theory is that such a modification would not alter the behaviour of the code (unless debug was switched on), so is the minimal change without extensive modification to fix the underlying problem.

@dsblank
Copy link
Member

dsblank commented Feb 7, 2025

@Nick-Hall said:

A better approach would be to make the proxies return empty objects instead.

Maybe (I mentioned that above). An even better approach would be that we would never have a reference handle that pointed to a an inaccessible object.

@kulath said:

I suppose the question we might ask is "does this handle point to an accessible object?" before trying to call get_person_from_handle.

I don't think we want to surround every db.get_object(handle) with code like that any more than we want to check for the results being None.

BUT, we could rethink the proxies to remove all handles to inaccessible objects. Let's consider that before changing every get method.

@stevenyoungs stevenyoungs marked this pull request as ready for review February 8, 2025 21:35
@stevenyoungs
Copy link
Contributor Author

I think this is ready for review.

This PR is to review the final commit only "Add type hints to db methods". It only adds type hints.
Earlier commits, which make code changes, should be reviewed in #1934

@stevenyoungs stevenyoungs changed the title WIP: Add type hints to db classes Add type hints to db classes Feb 8, 2025
…check for handle existence. As a result it can get the object directly rather then convert the raw data into an object
In future, this allows the return type hint of <object> | None for consistency with other db implementations
@stevenyoungs
Copy link
Contributor Author

stevenyoungs commented Feb 9, 2025

The following PRs are pre-requisites: #1934, #1945, #1948, #1949

changes from #1934 are included here. mypy will fail without #1945, #1948, #1949

@stevenyoungs stevenyoungs force-pushed the typing-db branch 2 times, most recently from cf983a9 to 25f76d1 Compare February 10, 2025 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants