Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to properly handle connection issues and reconnecting? (request for comments) #1660

Open
cerlestes opened this issue Jun 18, 2024 · 9 comments

Comments

@cerlestes
Copy link

cerlestes commented Jun 18, 2024

Hello everyone!

First off, thanks a lot to every contributor of this repository; it's a great library that has helped us out tremendously in multiple projects. Secondly, I hope it's okay that I'm using an issue to open a discussion. I'd like to gather some insights from people who are more knowledgable than I am about OPC-UA and this library, hoping that I'll be able to contribute a well-rounded feature out of this discussion in the future.

The topic is handling connection issues and reconnecting properly. Right now, whenever our application loses the connection to the OPC-UA server, for example because the PLC config changed and it's reloading the server, we're reconnecting the client from our application once we try to interact with a node and it fails (we catch the UaError and simply try connecting up to a few times). This was fine until subscriptions came into play. With subscriptions, I'm really having a hard time finding the proper way to detect issues, reconnect and restart the subscriptions.

I've found the Client._monitor_server_loop() method, which is started as a task into Client._monitor_server_task. Once the connection dies, it'll inform the subscriptions of the BadShutdown. This seems to be about the only way to be informed about a connection issue other than emulating that behaviour externally to the client, polling and catching errors when they are raised. Another method of detecting connection issues is the Client.check_connection() method. But again, this method must be polled from the application external to the client.

I think ideally the client itself should provide a mechanism to allow applications to react to connection issues and states in general, i.e. callback when the client lost the connection. On top of that, it should then implement an optional reconnect mechanism that, when enabled, automatically attempts to reconnect upon losing connection, including restoring any subscriptions.

My current proposal would be the following:

  • Add three asyncio.Event instances Client.connected, Client.disconnected, Client.failed. These events are set() when the respective connection state is reached and clear()-ed when the respectice state is left. This would allow application code to simply await client.connected.wait() before each interaction with the client. It would also allow to run error handler tasks once the connection fails with await client.failed.wait().
  • Maybe add a set of methods Client.add_connected_callback(), Client.add_disconnected_callback(), Client.add_failed_callback() to register callback functions which are called once the respective state is reached.
  • Add a new optional parameter to Client() which could be as simple as auto_reconnect: bool = False.
  • Whenever the auto_reconnect is enabled, an additional task Client._auto_reconnect_task will be created by the client upon connecting, which continously calls Client.check_connection() similiar to how the Client._monitor_server_loop() works, and in case of an error automatically tries connecting the client again.
  • Probably a bit more configuration is required for that feature, so maybe add a dataclass AutoReconnectSettings. The following settings come to mind:
    • How often to try reconnecting before giving up
    • How long to wait in between connection attempts (maybe with exponential backoff?)
    • Whether or not to restart the subscriptions after reestablishing the connection
  • Maybe even allow deeper customization by pulling the reconnection logic into its own class ClientReconnectHandler, which would implement a simple strategy pattern to allow interchangeable reconnection mechanisms, providing a ExponentialBackoffReconnectHandler by default. The parameter could then have the signature of auto_reconnect: bool | ClientReconnectHandler = False, applying a default handler with default values when simply set to True.

I'd love to hear what you guys think about this and how you would approach this. Maybe someone has already implemented a similiar reconnect mechanism and would like to share their thoughts, I'd greatly appreciate that.

Thanks a lot!

@cerlestes
Copy link
Author

cerlestes commented Jun 27, 2024

A small update after a week of trying to work on this:

I've implemented the solution I've stated above in an adapter class of our application code. It worked absolutely fine when I tried it on my Windows machine by cutting the SSH forwarding to the PLC network and reconnecting it; the client would pick up on this and restablish connection just fine. But when we tried running it from a Linux host, where we'd sporadically blackhole the PLC IP address via nftables to simulate packet loss, the application would sometimes end up hanging ad infinitum when trying to connect the same Client instance again after it had lost its connection. I couldn't find the solution to this problem; I don't want to point fingers at somebody else's code, but it seemed like the bug is either somewhere in the UaClient/UASocketProtocol or even deeper within Python. The UaClient calls asyncio.wait_for() when connecting, but this timeout never fires; the function simply hangs forever. Adding another asyncio.wait_for() around it in our application code didn't work either, which sounds to me like the whole thread is getting stuck somewhere deep down the rabbit hole.

But I've found an alternative solution that isn't a downgrade and works perfectly fine: simply recreate the Client instance when reconnecting. So the workflow is as follows:

  • Wait for connection loss using await client._monitor_server_task
  • Try to disconnect the existing instance to clean it up, this will always raise Exceptions as expected
  • Create a new Client instance
  • Connect the new instance
  • Rewrire the subscriptions and everything else to work with the new instance

I think the same workflow could be adapted to the Client class itself, so that it would simply replace its internal UaClient instance when reconnecting.

We'll put this method into 24/7 test for a week or two now and if that succeeds, we'll roll it out to two applications on the field. I'll report back if further adjustments were required to achieve a robust reconnect mechanism.

@oroulet
Copy link
Member

oroulet commented Jul 1, 2024

@cerlestes thanks for a good analysis and also some interesting propositions.
I see we could recreate the ua_client object in a task at connection loose, but then we would also need to remember all the subscriptions that client has (or even worse, check with server if the subscription still exists and try to reconnect them), sounds quite hard to make it work reliably in all cases

btw we had a similar issue at work and I just implemented that one: #1670 maybe that also helps you

@oroulet
Copy link
Member

oroulet commented Jul 1, 2024

but your last proposition is the async way of doing the same as a callback. We just need to make public that monitor_server_task() and it does the same

@cerlestes
Copy link
Author

@oroulet Nice commit, we would have needed that a few weeks ago 🤣 It's basically the same solution that I arrived at, only that my solution was external to the library. So +1 to that feature, looking forward to it releasing soon 👍 Maybe make it an async function though and await it?

Re subscriptions: I don't think it's a lot of work to get them working again; maybe I don't know enough about how they work to see the effects it would have on more complex cases though. In my adapter class, I simply try to delete the subscriptions by their ID once I reconnect, and then recreate them. This works absolutely fine for our use case. Maybe there are use-cases where it'd be preferred to keep the subscriptions if they still exist on the server, so that solution might not be the best way to go ahead, but it'd work. I'd implement the reconnect mechanism on the Client, not on UaClient, so the library would simply need to remember the values that subscriptions were created with in the Client and then it's as easy as calling create_subscription() again with those values after reconnecting.

PS: the reconnect mechanism I've described above has been running all weekend now, experiencing a controlled connection loss about every 3 minutes and reconnecting successfully afterwards for almost 1500 times already. So it seems to be working fine and stable.

@oroulet
Copy link
Member

oroulet commented Jul 1, 2024

I made it async now

@marcvanR
Copy link

Hi,

First of all, thanks a lot for this library, it helps us a lot!

This discussions interested me, but now I'm not sure how to implement this properly, is there some documentation somewhere that shows how to implement this??

I'm currently polling the client.check_connection() which raises an error which I catch to recreate the client, but what you describe sounds a bit less blunt.

Also I use the same strategy wrt subscriptions as @cerlestes describes, but now I am curious what the pitfalls are with that approach.

Thanks again!

@AndreasHeine
Copy link
Member

AndreasHeine commented Aug 15, 2024

@oroulet @cerlestes @marcvanR

just for clarification:

https://reference.opcfoundation.org/Core/Part4/v104/docs/6.7
image039

currently the problem is that the transport, session and subscription are to tightly coupled...

async def transfer_subscriptions(self, params: ua.TransferSubscriptionsParameters) -> List[ua.TransferResult]:
'''
https://reference.opcfoundation.org/Core/Part4/v104/5.13.7/
This Service is used to transfer a Subscription and its MonitoredItems from one Session to another.
For example, a Client may need to reopen a Session and then transfer its Subscriptions to that Session.
It may also be used by one Client to take over a Subscription from another Client by transferring the Subscription to its Session.
'''

Transport Reconnect must be independent of Session and Subscription!
One Client can have multiple Sessions and one Subscription is kind bound to One Session but can be transferred!

Transfer-Subscription must be implemented!

async def transfer_subscriptions(self, params: ua.TransferSubscriptionsParameters) -> List[ua.TransferResult]:
# Subscriptions aren't bound to a Session and can be transferred!
# https://reference.opcfoundation.org/Core/Part4/v104/5.13.7/
raise NotImplementedError

what i saw in the source code of ua_client is the crux and it would need to be refactored to enable lousily coupling:

session class with all Services (Read/Write/Browse see "AbstractSession" in session_interface.py) which need to get the client injected and not the other way around so a client can have multiple sessions...

thats a lot of work and i have almost 0 time... i am sorry i started with the AbstractSession stuff but couldn't bring it much further!

edit:

and of course it would break almost every existing client -> 2.0.0

edit2:

the Node-Class need to be removed or reworked as well... almost forgot to mention that

@oroulet
Copy link
Member

oroulet commented Aug 17, 2024

honestly I do not even have the time to try to understand. At work we use the check_connection() and reconnect when there is an issue. It works perfectly for us, and I do not feel the need for anything more.

@AndreasHeine
Copy link
Member

in simple words ua_client needs to be split into a Session-Class (implements session_interface) and a Client-Class (implements session less stuff e.g. Connect / Reconnect of the Transport and SecureChannel related stuff.

so if you interact with a server e.g. Read you use the Session-Class-Instance which got the Client-Class-Instance injected...

currently its structurly not possible to use the client according how spec describes its features... currently you will lose monitored item data thru not reuse the session and not transfer subscription to the new session 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants