-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zha "EZSP is not running" error #107798
Comments
Hey there @dmulcahey, @Adminiuga, @puddly, @TheJulianJES, mind taking a look at this issue as it has been labeled with an integration ( Code owner commandsCode owners of
(message by CodeOwnersMention) zha documentation |
If you're able to reproduce the crash, can you enable ZHA debug logging and post the whole log? |
Absolutely @puddly. This is the log that the snippet came from with debug logging enabled |
Is this related to #106185? Seems there's a problem with zha and ezsp? |
There's nearly a nine second break in your log during which radio communication (and all other activity within HA's event loop) completely stopped. EZSP radios, like your Sonoff stick, are extremely timing sensitive. During that period, ZHA is not able to acknowledge data coming in from your radio and the radio resets. ZHA is already running this all in a separate thread for latency purposes but it seems like this stall is system-wide. Some integration or external process is causing this stutter. So that we can figure out what's responsible, can you list what integrations and addons are you running? |
Sure! Add-ons: Integrations: |
So the radio resets because it isn't acknowledged and assumes a comms error? How come there's no recovery mechanism on that? My experience is when it is in this error case, it is just forever gone. I can only recover by reboot of the whole pi (triggered from the os) |
Are you able to share with me the timestamp where this lock up is occurring |
There is error recovery. The problem is that it fails to set up during recovery. It'll be retried indefinitely until it works but the underlying problem remains. I've annotated a portion of your log, hopefully it makes sense: # Normal radio traffic. Notice how sending a command immediately results in data being sent 0.001s later.
2024-01-09 14:09:20.679 DEBUG (MainThread) [bellows.ezsp.protocol] Send command getValue: (<EzspValueId.VALUE_FREE_BUFFERS: 3>,)
2024-01-09 14:09:20.680 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'31c021a9fe2a16767f7e'
2024-01-09 14:09:20.688 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'14c0a1a9fe2a15b3ae10647e'
2024-01-09 14:09:20.689 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'82503a7e'
2024-01-09 14:09:20.692 DEBUG (MainThread) [bellows.ezsp.protocol] Application frame received getValue: [<EzspStatus.SUCCESS: 0>, b'\xf7']
2024-01-09 14:09:20.693 DEBUG (MainThread) [bellows.zigbee.application] Free buffers status EzspStatus.SUCCESS, value: 247
2024-01-09 14:09:20.693 DEBUG (MainThread) [bellows.zigbee.application] ezsp_counters: [MAC_RX_BROADCAST = 58, MAC_TX_BROADCAST = 42, MAC_RX_UNICAST = 26, MAC_TX_UNICAST_SUCCESS = 12, MAC_TX_UNICAST_RETRY = 0, MAC_TX_UNICAST_FAILED = 0, APS_DATA_RX_BROADCAST = 1, APS_DATA_TX_BROADCAST = 1, APS_DATA_RX_UNICAST = 11, APS_DATA_TX_UNICAST_SUCCESS = 7, APS_DATA_TX_UNICAST_RETRY = 32, APS_DATA_TX_UNICAST_FAILED = 16, ROUTE_DISCOVERY_INITIATED = 9, NEIGHBOR_ADDED = 2, NEIGHBOR_REMOVED = 0, NEIGHBOR_STALE = 0, JOIN_INDICATION = 0, CHILD_REMOVED = 0, ASH_OVERFLOW_ERROR = 0, ASH_FRAMING_ERROR = 0, ASH_OVERRUN_ERROR = 0, NWK_FRAME_COUNTER_FAILURE = 0, APS_FRAME_COUNTER_FAILURE = 0, UTILITY = 0, APS_LINK_KEY_NOT_AUTHORIZED = 0, NWK_DECRYPTION_FAILURE = 0, APS_DECRYPTION_FAILURE = 0, ALLOCATE_PACKET_BUFFER_FAILURE = 0, RELAYED_UNICAST = 0, PHY_TO_MAC_QUEUE_LIMIT_REACHED = 0, PACKET_VALIDATE_LIBRARY_DROPPED_COUNT = 0, TYPE_NWK_RETRY_OVERFLOW = 0, PHY_CCA_FAIL_COUNT = 0, BROADCAST_TABLE_FULL = 0, PTA_LO_PRI_REQUESTED = 0, PTA_HI_PRI_REQUESTED = 0, PTA_LO_PRI_DENIED = 0, PTA_HI_PRI_DENIED = 0, PTA_LO_PRI_TX_ABORTED = 0, PTA_HI_PRI_TX_ABORTED = 0, ADDRESS_CONFLICT_SENT = 0, EZSP_FREE_BUFFERS = 247]
# pychromecast seems to start having issues
2024-01-09 14:09:31.229 ERROR (Thread-7) [pychromecast.socket_client] [Living Room TV(192.168.0.98):8009] Error reading from socket.
2024-01-09 14:09:31.230 WARNING (Thread-7) [pychromecast.socket_client] [Living Room TV(192.168.0.98):8009] Error communicating with socket, resetting connection
# We feed the watchdog to make sure the radio is still alive
2024-01-09 14:09:31.261 DEBUG (MainThread) [zigpy.application] Feeding watchdog
2024-01-09 14:09:31.261 DEBUG (MainThread) [bellows.ezsp.protocol] Send command readCounters: ()
# Nine seconds later (!), not a single byte has been sent
2024-01-09 14:10:40.065 WARNING (Thread-8) [pychromecast.socket_client] [Bedroom speaker(192.168.0.178):8009] Heartbeat timeout, resetting connection
2024-01-09 14:10:40.068 WARNING (Thread-9) [pychromecast.socket_client] [Kitchen Assistant(192.168.0.107):8009] Heartbeat timeout, resetting connection
# Finally, data starts being sent and received in one massive block, meaning it has been queuing up in an OS buffer for a long time
2024-01-09 14:10:40.084 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'42c121a9a52a73fd7e'
2024-01-09 14:10:40.095 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'24c0b1a9112a15b6588d4a24ab1593499cd1bf69b4469874f1c71588fc3f2eade9ccee778e0a037e'
2024-01-09 14:10:40.096 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:10:40.102 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'2cc0b1a9112a15b6588d4a24ab1593499cd1bf69b4469874f1c71588fc3f2eade9ccee778ed8bc7e'
2024-01-09 14:10:40.103 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:10:40.105 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'2cc0b1a9112a15b6588d4a24ab1593499cd1bf69b4469874f1c71588fc3f2eade9ccee778ed8bc7e'
2024-01-09 14:10:40.106 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:10:40.108 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'2cc0b1a9112a15b6588d4a24ab1593499cd1bf69b4469874f1c71588fc3f2eade9ccee778ed8bc7e'
2024-01-09 14:10:40.109 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
# The radio says that it is resetting because nothing has responded to it in many seconds
2024-01-09 14:10:40.110 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.111 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'34c0b1a9112a15b658914e24ab5593499ceebf69b4469874f5de1483fc7e1ec1f3f7a97e'
2024-01-09 14:10:40.113 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'8430fc7e'
2024-01-09 14:10:40.120 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.124 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.125 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.127 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.131 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
2024-01-09 14:10:40.135 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e'
...
2024-01-09 14:10:40.976 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: <NcpResetCode.ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT: 81>
...
# ZHA is restarting again a few seconds later
2024-01-09 14:10:50.830 DEBUG (bellows.thread_0) [zigpy.serial] Opening a serial connection to '/dev/serial/by-id/usb-ITEAD_SONOFF_Zigbee_3.0_USB_Dongle_Plus_V2_20231122145443-if00' (115200 baudrate)
...
2024-01-09 14:10:54.778 DEBUG (MainThread) [bellows.ezsp.protocol] Send command getCurrentSecurityState: ()
2024-01-09 14:10:54.779 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'016d21a93d2ae7007e'
2024-01-09 14:10:54.786 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'116da1a93d2a154659ccf3b754aa2112d0c8a47e'
2024-01-09 14:10:54.787 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'82503a7e'
2024-01-09 14:10:54.789 DEBUG (MainThread) [bellows.ezsp.protocol] Application frame received getCurrentSecurityState: [<EmberStatus.SUCCESS: 0>, EmberCurrentSecurityState(bitmask=<EmberCurrentSecurityBitmask.GLOBAL_LINK_KEY|HAVE_TRUST_CENTER_LINK_KEY|TRUST_CENTER_USES_HASHED_LINK_KEY|96: 244>, trustCenterLongAddress=4c:5b:b3:ff:fe:92:b9:58)]
# Same problem again: 32 (!) second break in communication with the radio
2024-01-09 14:11:28.988 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'216db1a9112a15b658914e24ab5593499cef876fb4469874f5de1b83fc7e1e4efc0fff7e'
2024-01-09 14:11:28.989 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:11:28.992 DEBUG (MainThread) [bellows.ezsp.protocol] Send command setPolicy: (<EzspPolicyId.TRUST_CENTER_POLICY: 0>, <EzspDecisionId.ALLOW_PRECONFIGURED_KEY_JOINS: 1>)
2024-01-09 14:11:29.001 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'296db1a9112a15b658914e24ab5593499cef876fb4469874f5de1b83fc7e1e4efc219b7e'
2024-01-09 14:11:29.002 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
# pychromecast having issues again
2024-01-09 14:11:29.006 WARNING (Thread-9) [pychromecast.socket_client] [Kitchen Assistant(192.168.0.107):8009] Heartbeat timeout, resetting connection
2024-01-09 14:11:29.007 WARNING (Thread-7) [pychromecast.socket_client] [Living Room TV(192.168.0.98):8009] Heartbeat timeout, resetting connection
2024-01-09 14:11:29.011 WARNING (Thread-8) [pychromecast.socket_client] [Bedroom speaker(192.168.0.178):8009] Heartbeat timeout, resetting connection
2024-01-09 14:11:29.013 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'296db1a9112a15b658914e24ab5593499cef876fb4469874f5de1b83fc7e1e4efc219b7e'
2024-01-09 14:11:29.014 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
2024-01-09 14:11:29.017 DEBUG (bellows.thread_0) [bellows.uart] Data frame: b'296db1a9112a15b658914e24ab5593499cef876fb4469874f5de1b83fc7e1e4efc219b7e'
2024-01-09 14:11:29.017 DEBUG (bellows.thread_0) [bellows.uart] Sending: b'83401b7e'
# The radio resets again
2024-01-09 14:11:29.041 DEBUG (bellows.thread_0) [bellows.uart] Error code: ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT, Version: 2, frame: b'c20251a8bd7e' |
having same issue with but with skyconnect (multiprotocol enabled) it only recovers if full reboot is done |
@meramsey Multiprotocol issues aren't something Core has any influence on. Please upload your logs here: home-assistant/addons#3408 |
I have disabled any integrations I don't actively require right now (cast being one of them). Is cast always having issues around the error? Although I'm realising the two cases you pointed out, the first cast has already errored before the issue, the second it errors after the issue |
I'm experiencing the same issue. It started somewhere in the middle of December. |
I did read that the wifi bridge is know for issues due to dropped wifi packets. This issue is with a serial connection to dongle. That's why there shouldn't be an issue |
@puddly I have disabled all integrations I am not using: Error still occured. |
Issues for me as well, all zigbee devices went offline at night. I see in the logs: |
For what it is worth : I had the same issue with Homeassistant 2024.3.1 running on a Synology inside a container created from homeassistant/home-assistant:stable with Sonoff Zigbee 3.0 Plus-E Dongle. I resolved the problem by downgrading to 2024.2.3. |
Hi I use the Skyconnect zigbee stick and I have the same issues. It started with crashing a couple of months ago after a home assistant upgrade (can be coincidence). now it happens multiple times per day. I'm desperate. I have search for days to find a solution on home assistant forum, github, reddit etc. no fix so far. @puddly You seem to have a lot of knowledge about this. Is it ok if I share my log with you as well? |
I guess similar as zigpy/bellows#617 (which is not Home Assistant) |
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. |
The problem
Zha regularly drops to "initialising" state while using a Sonoff ZBDonlge E. This causes all of my zigbee devices to become unavailable and unusable. The only recovery I have found is reboot the home assistant setup
What version of Home Assistant Core has the issue?
2024.1.2
What was the last working version of Home Assistant Core?
No response
What type of installation are you running?
Home Assistant OS
Integration causing the issue
Zigbee Home Automation
Link to integration documentation on our website
https://www.home-assistant.io/integrations/zha/
Diagnostics information
No response
Example YAML snippet
No response
Anything in the logs that might be useful for us?
2024-01-09 14:16:49.719 DEBUG (MainThread) [zigpy.application] Feeding watchdog 2024-01-09 14:16:49.720 DEBUG (MainThread) [bellows.ezsp] Couldn't send command readCounters(()). EZSP is not running 2024-01-09 14:16:49.720 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: EzspError('EZSP is not running') 2024-01-09 14:16:49.721 WARNING (MainThread) [zigpy.application] Watchdog failure Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 665, in _watchdog_loop await self.watchdog_feed() File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 647, in watchdog_feed await self._watchdog_feed() File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 999, in _watchdog_feed (res,) = await self._ezsp.readCounters() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 212, in _command raise EzspError("EZSP is not running") bellows.exception.EzspError: EZSP is not running 2024-01-09 14:16:49.739 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: EzspError('EZSP is not running') 2024-01-09 14:16:49.745 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop
Additional information
Unknown if ever working. I have not had this ever working. I am aware that the sonoff P dongle is reported to have issues but the newer E dongle uses a whole new chip and stack. I have seen reports of the ZBBridge having issues.
The setup is a RPI 3B+ with default RPI PSU. I have tested the dongle connected directly to the serial bus and through a 1 m extension. The RPI is using wifi to connect to the network.
The text was updated successfully, but these errors were encountered: