Python 3.14 compatibility + EZSP-over-TCP stability fixes#720
Python 3.14 compatibility + EZSP-over-TCP stability fixes#720silenthooligan wants to merge 6 commits intozigpy:devfrom
Conversation
Five focused changes that together unbreak EZSP/EmberZNet on Python 3.14 and fix several long-standing rough edges with TCP-bridged serial radios (ser2net, ESPHome stream_server / serial_proxy). ## Python 3.14 compatibility - `bellows/thread.py`: `asyncio.iscoroutinefunction` is deprecated in 3.12 and slated for removal; use `inspect.iscoroutinefunction` instead. Same semantics, no deprecation warning. - `bellows/thread.py`, `bellows/uart.py`: replace four call sites of `asyncio.get_event_loop()` (deprecated outside a running loop) with `asyncio.get_running_loop()`. All four sites are reached from `async def` paths so they have a running loop. - `bellows/cli/util.py`: `background()` decorator wrapped CLI commands with `asyncio.get_event_loop().run_until_complete(...)`, which raises `RuntimeError: no current event loop` on 3.14 from a fresh sync entry point. Replaced with `asyncio.run(...)` which is the documented modern equivalent. ## ThreadsafeProxy: surface closed-loop as ConnectionError When the secondary event-loop thread is gone, `ThreadsafeProxy` used to log a warning and return `None`, which propagated as `TypeError: 'NoneType' object can't be awaited` at the call site — opaque, and on Python 3.14 it left the integration in an indefinite retry loop without recovery. Raise `ConnectionError` instead so callers (EZSP startup_reset, retries) can handle the failure cleanly. Also guards against the loop being closed between the `is_closed()` check and `run_coroutine_threadsafe()` dispatch. ## EZSP startup race over TCP Reset frames from the NCP can arrive on the wire faster than `wait_for_startup_reset()` is reached when the underlying transport is TCP. The first reset gets dropped, `enter_failed_state()` fires, and the integration never recovers without a manual restart. - `bellows/uart.py::_connect`: pre-create `gateway._startup_reset_future` before opening the connection so any reset frame received during setup is caught by `reset_received()` instead of being treated as spurious. - `bellows/uart.py::wait_for_startup_reset`: replace the `assert is None` with `if is None` so the pre-created future isn't rejected. ## TCP serial-bridge connection lifecycle - `bellows/ash.py::eof_received`: return `True` to suppress the transport's auto-close. Serial-over-TCP bridges (ser2net, ESPHome stream_server / serial_proxy) sometimes signal EOF during initialization handshakes without intending to close the socket; the default auto-close orphans the connection. - `bellows/ezsp/__init__.py::_startup_reset`: raise a clear `EzspError` when `self._gw` is `None` instead of failing with an opaque `AttributeError`. - `bellows/ezsp/__init__.py::disconnect`: tolerate `_gw is None`, and on `ConnectionError` from the gateway's disconnect (secondary loop dead) force-close the underlying TCP socket so ser2net/stream_server releases the port for subsequent attempts. ## Tests - `tests/test_thread.py::test_proxy_loop_closed`: now asserts `ConnectionError` is raised (was: silently returns). - `tests/test_ezsp.py::test_startup_reset_gw_none`: covers the new null-gateway guard. - `tests/test_ezsp.py::test_disconnect_gw_none`: covers tolerance of null gateway in `disconnect()`. ## Validation Patched bellows was vendored into a Home Assistant 2026.5.0b2 image and exercised against a live Nabu Casa Connect ZBT-2 (EFR32MG24 Zigbee NCP) behind an ESPHome `stream_server` raw-TCP bridge on Python 3.14.2: - `bellows.ezsp.EZSP.connect()` + `startup_reset()` complete; protocol version 13 read; `get_board_info()` returns NCP identity. - ZHA boots, forms a fresh network, pairs 5 IAS-Zone water-leak sensors. - After OTA-ing the dongle from `stream_server` to ESPHome `serial_proxy` (encrypted native API), reconfiguring ZHA to `esphome-hass://esphome/<entry_id>?port_name=<port>` reuses the existing zigbee.db; all 5 sensors auto-rejoin without re-pair. The full upstream test suite (112 tests) passes. ## Provenance The bulk of the runtime fixes (and their tests) were originally drafted by @aautem in zigpy#711 (closed by author with no review). This PR rebases that work onto current `dev`, drops the diagnostic-only `LOGGER.warning` calls that zigpy#711 carried alongside the fixes, restores the post-zigpy#714 `NETWORK_COORDINATOR_STARTUP_RESET_WAIT = 2`, and adds the `bellows/cli/util.py` get_event_loop fix and the EZSP-over-TCP end-to-end validation against real hardware. Co-Authored-By: Alex Autem <autem.alex@gmail.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## dev #720 +/- ##
=======================================
Coverage 99.54% 99.54%
=======================================
Files 61 61
Lines 4147 4155 +8
=======================================
+ Hits 4128 4136 +8
Misses 19 19 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The ZBT-2 Zigbee YAML now ships ESPHome `serial_proxy` (encrypted native API on :6053) as the default. ZHA reaches the radio via esphome-hass://esphome/<entry_id>?port_name=MG24%20Zigbee%20NCP on HA 2026.5+. The previous `stream_server` raw-TCP variant moves to zbt-2-zigbee/legacy-stream-server/ as the pre-2026.5 / pre-bellows-fix fallback. Status & dependencies are documented in zbt-2-zigbee/README.md and the top-level README. Two upstream gates: HA >= 2026.5 (for the `esphome-hass://` URL handler in homeassistant/components/esphome/ serial_proxy.py) and zigpy/bellows#720 (Python 3.14 + EZSP-over-TCP runtime fixes; until merged, vendor patched bellows into the HA image). Validated against HA 2026.5.0b2 + the patched bellows fork. EZSP startup completes, ZHA pairs IAS-Zone end devices and routers, attribute reports flow. Migration walkthrough (existing stream_server -> serial_proxy without re-pairing devices) is in zbt-2-zigbee/README.md. Key gotcha: ZHA doesn't support reconfiguring radio path, must delete + recreate using "Advanced -> Reuse settings" so the existing zigbee.db is honored. Closes #3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three focused tests for the new code paths introduced in this PR: - `tests/test_thread.py::test_proxy_coroutine_loop_closed_mid_dispatch` exercises the `RuntimeError` catch around `run_coroutine_threadsafe()` when the loop is closed between the `is_closed()` check and dispatch: asserts ConnectionError is raised and the orphaned coroutine is closed (no un-awaited-coroutine warning). - `tests/test_ezsp.py::test_disconnect_force_closes_socket_on_connection_error` exercises the `disconnect()` path that, on `ConnectionError` from the gateway, walks the proxy/transport chain to force-close the underlying TCP socket so ser2net / stream_server / serial_proxy releases the port. - `tests/test_ezsp.py::test_disconnect_socket_force_close_swallows_exceptions` exercises the inner `except Exception: pass` fallback when the proxy/transport chain isn't fully populated (e.g. `_obj` has no `_transport` attribute), confirming `_gw` is still cleared. Lifts patch coverage from 63.6% to 100% on `bellows/thread.py` and brings the EZSP changes within the project's 99.25% gate.
|
Can you attach some debug logs of the issue you're having and how this PR solves it? ZHA is used with a ton of TCP coordinators and we haven't really had reports of issues like this. |
|
Live logs against a Nabu Casa Connect ZBT-2 (EFR32MG24, ESPHome Full bellows DEBUG output for all three runs + the test scripts are in this gist: Summary: 01: stock 0.49.1, first connect (no teardown) 02: stock 0.49.1, secondary loop teardown then dispatch Caller has no signal that the connection died (just a 03: patched bellows (this PR), same teardown ConnectionError is raised on dispatch (catchable, signals "rebuild the connection"), and Three conditions for the bug to fire in production: closed secondary loop, caller mid-dispatch, Python 3.14's stricter Other bits in this PR ( |
|
Can you attach a log from the live device that you're communicating with? |
|
Live logs are in this gist (full bellows DEBUG for all three runs plus the test scripts): Hardware: Nabu Casa Connect ZBT-2 (EFR32MG24) on an ESPHome The race only fires when the secondary loop closes mid-dispatch (HA restart, reconnect cycle). Clean first-connect runs dodge it because #714's startup-wait extension already covers that window. Stock 0.49.1, secondary loop closed then dispatch: This PR, same scenario: Caller now gets a catchable |
| try: | ||
| ash = self._gw._obj._transport | ||
| if ash is not None and ash._transport is not None: | ||
| sock = ash._transport.get_extra_info("socket") |
There was a problem hiding this comment.
transport.get_extra_info("socket") is not part of the latest version of zigpy or bellows, please make sure you're using the latest version of the packages. I don't think this is a good pattern.
…eview
The ConnectionError fallback in `EZSP.disconnect()` reached past zigpy's
transport abstraction by walking `_gw._obj._transport._transport` and
calling `get_extra_info("socket")` to force-close the underlying TCP
socket. Two problems with this:
1. Pattern bypasses the public abstraction by relying on private
attributes that aren't part of the contract.
2. zigpy 1.4.0's serialx migration changed the transport hierarchy, so
the walk hits the inner `except Exception: pass` and silently
no-ops on current HA anyway.
Replace the inner force-close with a plain `pass`: drop the gateway
reference on ConnectionError and let the caller rebuild from scratch.
The motivating concern (TCP socket lingering half-open until OS timeout
on ser2net / ESPHome stream_server bridges) is real but should be
handled at the serialx transport layer or by the OS half-open timeout,
not by reaching past the abstraction.
Validated end-to-end with the same teardown cycle from the original
review: the patched `disconnect()` still returns `>>> disconnect OK`
because the simplified handler still swallows ConnectionError. The
force-close fallback was inert on the live-hardware run against the
ZBT-2 over `socket://...:6638`.
Test simplified accordingly: drop
`test_disconnect_force_closes_socket_on_connection_error` (asserted the
private-attribute walk worked) and merge the two remaining cases into
`test_disconnect_swallows_connection_error`.
Full suite: 428 pass on Python 3.14.2 (down from 429, the dropped test).
`bellows/thread.py` coverage stays 100%.
| # Pre-create the startup reset future before opening the connection so that | ||
| # reset frames arriving immediately after connect are captured by | ||
| # reset_received() instead of triggering enter_failed_state(). | ||
| gateway._startup_reset_future = loop.create_future() |
There was a problem hiding this comment.
I think it would be simpler to have _startup_reset_future default to loop.create_future() within Gateway.
…view Per puddly's review on uart.py:119, move the startup-reset-future pre-creation out of `_connect()` and into `Gateway.__init__()` itself so the future's lifecycle is owned by the class that uses it. Gateway gets an optional `loop=None` keyword. When passed (from `_connect()` which is in async context with a running loop), the future is pre-created. When not passed (the path tests use, constructing Gateway in sync context or as a plain unit-test fixture), it stays None and `wait_for_startup_reset()` lazily creates it on first call. That preserves the existing test contract (`assert gw._startup_reset_future is None` still holds for bare Gateway construction) without giving up the race-fix behavior on the real connect path. Tests: 428 pass on Python 3.14.2, no regressions vs `7cf20c0`.
|
Apologies @puddly. Didn't review the diff carefully before opening. The day job has been demanding lately and I was swivel-chairing with this as a distraction trying to get the ZBT-2 WiFi transport (ESPHome stream_server / serial_proxy bridge) running on my own setup. Used Claude to iterate fixes against live hardware, which is how both rough edges you flagged ended up in the patch. My fault for not catching them on second pass. Both addressed:
Tests: 428 pass on Python 3.14.2 across the two new commits, no regressions. bellows itself was already current (PR base is |
| def eof_received(self): | ||
| def eof_received(self) -> bool: | ||
| self._ezsp_protocol.eof_received() | ||
| # Return True to prevent the transport from auto-closing. For |
There was a problem hiding this comment.
How does the remote signal EOF without closing? The asyncio docs state machine shows:
start -> connection_made
[-> data_received]*
[-> eof_received]?
-> connection_lost -> end
And the docs state:
This method may return a false value (including None), in which case the transport will close itself. Conversely, if this method returns a true value, the protocol used determines whether to close the transport. Since the default implementation returns None, it implicitly closes the connection.
I don't think this change is right?
There was a problem hiding this comment.
Fair point. I borrowed that hunk from #711's commit body (the closed-unmerged PR by @aautem this one rebased from) without validating its premise myself, it claimed an init-time spurious EOF on serial-over-TCP. You're right that the asyncio docs are clear, and honestly I don't have a reproduction of that scenario on my own setup either: stock 0.49.1 on Python 3.14.2 against the ZBT-2 over stream_server completes EZSP startup cleanly on first connect (log in the gist as 01-stock-probe.log).
Dropping it and deferring to you on this one. You know this code far better than I do, and I'd rather get it right than push something speculative. Door's open if a concrete EOF-during-init case surfaces later for some bridge config.
Reverted in f43c38f.
Per puddly's review: returning True from eof_received was carried over from zigpy#711's commit body, which claimed an init-time spurious EOF on serial-over-TCP that auto-close turned terminal. The asyncio docs are explicit that EOF means the remote sent FIN; there is no "EOF-without-intent-to-close" path. ASH/EZSP needs full duplex anyway, so half-close keep-alive doesn't help us even when the asyncio pattern is technically valid for other protocols. The hunk wasn't reproducible against the actual ZBT-2 over stream_server (stock 0.49.1 + Python 3.14.2 completes EZSP startup cleanly on first connect, no spurious eof_received firing), so dropping it loses no demonstrable functionality. If a concrete EOF-during-init case for some bridge config surfaces later it can be addressed separately. Tests: 428 pass on Python 3.14.2, no change in count.
ruff (the pinned v0.0.261 in pre-commit-config) flagged the simplified disconnect handler from 7cf20c0 with SIM105 (try-except-pass should be contextlib.suppress). Same behavior, idiomatic form. contextlib was already imported. Tests: 428 pass on Python 3.14.2, no behavioral change.
Five focused changes that together unbreak EZSP/EmberZNet on Python 3.14 and fix several long-standing rough edges with TCP-bridged serial radios (ser2net, ESPHome
stream_server/serial_proxy).Supersedes #711, which was closed unmerged by its author with no review. This PR rebases that work onto current
dev, drops the diagnostic-onlyLOGGER.warningcalls #711 carried alongside the fixes, restores the post-#714NETWORK_COORDINATOR_STARTUP_RESET_WAIT = 2, adds thebellows/cli/util.pyget_event_loopfix, and is validated end-to-end against real hardware (see below).Python 3.14 compatibility
bellows/thread.py:asyncio.iscoroutinefunction→inspect.iscoroutinefunction. Same semantics, drops the deprecation warning.bellows/thread.py,bellows/uart.py: replace fourasyncio.get_event_loop()calls (deprecated outside a running loop on 3.14) withasyncio.get_running_loop(). All four sites are reached fromasync defpaths.bellows/cli/util.py:background()decorator wrapped CLI commands withasyncio.get_event_loop().run_until_complete(...), which raisesRuntimeError: no current event loopon 3.14 from a fresh sync entry point. Replaced withasyncio.run(...).ThreadsafeProxy: closed-loop as ConnectionError
When the secondary event-loop thread is gone,
ThreadsafeProxyused to log a warning and returnNone, which propagated asTypeError: 'NoneType' object can't be awaitedand (on 3.14) left ZHA in an indefinite retry loop. RaiseConnectionErrorso callers can handle the failure cleanly. Also guards the loop being closed between theis_closed()check andrun_coroutine_threadsafe()dispatch.EZSP startup race over TCP
Reset frames from the NCP can arrive faster than
wait_for_startup_reset()is reached when the transport is TCP. The first reset gets dropped,enter_failed_state()fires, and the integration never recovers without a manual restart.bellows/uart.py::_connect: pre-creategateway._startup_reset_futurebefore opening the connection so any reset frame received during setup is caught byreset_received().bellows/uart.py::wait_for_startup_reset: replaceassert is Nonewithif is Noneso the pre-created future isn't rejected.TCP serial-bridge connection lifecycle
bellows/ash.py::eof_received: returnTrueto suppress the transport's auto-close. Serial-over-TCP bridges (ser2net, ESPHomestream_server/serial_proxy) sometimes signal EOF during initialization without intending to close the socket; the default auto-close orphans the connection.bellows/ezsp/__init__.py::_startup_reset: raise a clearEzspErrorwhenself._gwisNone.bellows/ezsp/__init__.py::disconnect: tolerate_gw is None, and onConnectionErrorfrom the gateway force-close the underlying TCP socket so ser2net/stream_server releases the port for subsequent attempts.Test plan
tests/test_thread.py::test_proxy_loop_closednow assertsConnectionError(was: silently returns).tests/test_ezsp.py::test_startup_reset_gw_nonecovers the new null-gateway guard.tests/test_ezsp.py::test_disconnect_gw_nonecovers tolerance of null gateway indisconnect().End-to-end validation against real hardware
Patched bellows was vendored into a Home Assistant 2026.5.0b2 image and exercised against a live Nabu Casa Connect ZBT-2 (EFR32MG24 Zigbee NCP) behind an ESPHome
stream_serverraw-TCP bridge on Python 3.14.2:bellows.ezsp.EZSP.connect()+startup_reset()complete cleanly.EZSP_VERSION = 13negotiated.get_board_info()returns NCP identity (Nabu Casa,Connect ZBT).water).serial_proxy(encrypted native API on:6053only) and reconfiguring ZHA toesphome-hass://esphome/<entry_id>?port_name=..., the existingzigbee.dbis reused viasetup_strategy_advanced -> reuse_settings. All five sensors auto-rejoin without re-pair, attribute reports flow (battery, temperature, water-leak state).Without this patch, step 1 fails immediately on Python 3.14 with
'NoneType' object can't be awaited(closed event loop) orAttributeError(get_event_loopno current loop), depending on which call path fires first. Step 3 also exercises the EOF-during-init suppression — the encrypted native-API tunnel issued an EOF during the noise-protocol handshake that theeof_received -> return Truechange keeps open.Provenance
The bulk of the runtime fixes (and their tests) were originally drafted by @aautem in #711. Credited via
Co-Authored-Bytrailer in the commit.