Dealing with sleeping nodes: Engaging contact, assumptions about awake status #1428

AlCalzone · 2021-01-16T23:12:16Z

AlCalzone
Jan 16, 2021
Maintainer

@hanskroner Here's the next question about sleeping nodes...

When the driver should send a message to a sleeping node, it currently queues it with a special priority. This priority is higher than most others when the node is awake and lower if the node sleeps. This way we make sure that the messages don't block the queue, but get prioritized when the node wakes up.
When such a message gets transmitted and is not acknowledged by the node, it stays in the queue instead of the call being rejected and the node gets marked as asleep.

I'm wondering if you have any tips what else to do when communicating with sleeping nodes. I've tried to be smart and make assumptions about the status (e.g. for a while a node was thought to be asleep 10 seconds after the last non-unsolicited contact), but that backfired.

hanskroner · 2021-01-17T09:57:07Z

hanskroner
Jan 17, 2021

My suggestion here is along the lines of what I'm suggesting for most other questions: get the basics rock-solid first, then move on to optimizations. I would suggest not giving into pressure to match the performance of existing libraries right out of the gate - they've been around for much longer and have had time to get to where they are.

It sounds like zwave-js is trying to track the awake/asleep state of the Non-Listening (NL) nodes. By now, you'll likely have pieced from all my other suggestions that I'm going to recommend against this. Z-Wave requires that NL nodes send the Wake-Up Notification to one, and only one, controller. This simplifies the task for the controller handling NL nodes, since it can assume it will be the only node in the network sending data to the NL node. I would suggest that instead of trying to make assumptions about the listening state of these devices, you control it.

The simplest way to do this is to park outbound messages to NL nodes in a separate queue. Conceptually, one for each NL node. These queues are not meant to store all messages, they need to be clever enough to not store duplicates and to also realize that some messages have an expiration date. Security commands are particularly challenging - queueing NONCE Get commands is not particularly useful and queuing NONCE Reports is totally useless. The library's outgoing machinery needs to support the encapsulation of commands just before transmitting them to ensure the state of the PRNG is what the receiver expects and that the NONCEs are fresh. It also needs to be ready to deal with transmission errors that might generate additional transmissions not in the queue - for example, the SPAN for the receiver node being out of sync, thus necessitating the request of a fresh NONCE before being able to re-transmit.

The library maintains this queue until it receives a Wake-Up Notification from the NL node. It then works its way through the queue, transmitting the necessary messages to the node until the queue is empty. Once the queue is empty, the library uses the Wake-Up No More Information command to inform the NL node that there is nothing more for it to receive - which instructs it to go back to sleep. There is an exception to this during the inclusion process, where the NL node will be awake but the controller won't receive a Wake-Up Notification. This needs to be handled so that the post-inclusion interview and setup happens as planned, instead of getting queued. The transmission timeouts for commands in the queue will also need to be in line with the recommended "stay awake" times for NL nodes, so that the NL node doesn't timeout waiting for messages and goes back to sleep before the controller's queue is empty.

Something similar to the description above gives you a solid starting point, and avoids the need for tracking yet another piece of state that the library can't really rely on. There's likely other corner-cases to be handled and optimizations to be made. The "classic" one is allowing a command entering the "sleeping" queue to "peek" by being transmitted - if it's ACK'd then the queue is processed until empty or a transmission fails. The library of course needs to be smart and not peek every single command going into the queue, as well as accepting the "dead air" time while waiting for the message to be ACK'd or timeout. Typically, this is one of the first "tweaks" made in order to support some mis-behaving devices that send NIFs or other frames when woken up manually, instead of Wake-Up Notifications. It's not ideal, but it's much better than the "common-but-bad"^TM alternative of interpreting NIFs as Wake-Up Notifications.

2 replies

AlCalzone Jan 17, 2021
Maintainer Author

It sounds like zwave-js is trying to track the awake/asleep state of the Non-Listening (NL) nodes. By now, you'll likely have pieced from all my other suggestions that I'm going to recommend against this.

In that case, do you have thoughts on how to communicate with users that their command will likely not be handled immediately? Displaying the awake status seems like a natural way to do that.

The simplest way to do this is to park outbound messages to NL nodes in a separate queue. Conceptually, one for each NL node

zwave-js does that, except it uses a single queue and gives messages to NL nodes a special priority which achieves the desired behavior.

Once the queue is empty, the library uses the Wake-Up No More Information command to inform the NL node that there is nothing more for it to receive - which instructs it to go back to sleep. There is an exception to this during the inclusion process, where the NL node will be awake but the controller won't receive a Wake-Up Notification. This needs to be handled so that the post-inclusion interview and setup happens as planned, instead of getting queued.

That's already in place ✔

The transmission timeouts for commands in the queue will also need to be in line with the recommended "stay awake" times for NL nodes, so that the NL node doesn't timeout waiting for messages and goes back to sleep before the controller's queue is empty.

This however is a problem currently. The reply timeout (#1416) of 10 seconds coincides with the recommended awake timer for NL nodes. This means that if we are waiting for a reponse that will never come, the node is likely back to sleep once we try to send the next command. ❌

These queues are not meant to store all messages, they need to be clever enough to not store duplicates

That is not entirely taken care of. We try to avoid some duplicates and avoid queueing pings, but we could certainly be smarter here. 🔸

to also realize that some messages have an expiration date.

Done ✔

Security commands are particularly challenging - queueing NONCE Get commands is not particularly useful and queuing NONCE Reports is totally useless.

I'm actually not sure here - this might partially be done. 🔸

The library's outgoing machinery needs to support the encapsulation of commands just before transmitting them to ensure the state of the PRNG is what the receiver expects and that the NONCEs are fresh. It also needs to be ready to deal with transmission errors that might generate additional transmissions not in the queue - for example, the SPAN for the receiver node being out of sync, thus necessitating the request of a fresh NONCE before being able to re-transmit.

These scenarios are supported because they were already necessary for S0 (I call them "handshake" transactions). ✔
Granted, the code could be a bit more elegant (I have a couple things in mind), but it gets the job done.

The "classic" one is allowing a command entering the "sleeping" queue to "peek" by being transmitted - if it's ACK'd then the queue is processed until empty or a transmission fails. The library of course needs to be smart and not peek every single command going into the queue

I was thinking of something along these lines - basically ping a non-listening device when the first message gets added to its queue and we're not sure that it is awake (i.e. not currently communicating with it).

hanskroner Jan 17, 2021

In that case, do you have thoughts on how to communicate with users that their command will likely not be handled immediately? Displaying the awake status seems like a natural way to do that.

Unless I’m not understanding your question, that’s a problem for the library’s clients to figure out. They already know the device is Non-Listening, and that’s enough for the scheme I described, where the NL devices sleep 99% of the time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with sleeping nodes: Engaging contact, assumptions about awake status #1428

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Dealing with sleeping nodes: Engaging contact, assumptions about awake status #1428

AlCalzone Jan 16, 2021 Maintainer

Replies: 1 comment · 2 replies

hanskroner Jan 17, 2021

AlCalzone Jan 17, 2021 Maintainer Author

hanskroner Jan 17, 2021

AlCalzone
Jan 16, 2021
Maintainer

Replies: 1 comment 2 replies

hanskroner
Jan 17, 2021

AlCalzone Jan 17, 2021
Maintainer Author