Skip to content

Python 3.14 compatibility + EZSP-over-TCP stability fixes#720

Open
silenthooligan wants to merge 6 commits intozigpy:devfrom
silenthooligan:fix/python-3.14-ezsp-tcp
Open

Python 3.14 compatibility + EZSP-over-TCP stability fixes#720
silenthooligan wants to merge 6 commits intozigpy:devfrom
silenthooligan:fix/python-3.14-ezsp-tcp

Conversation

@silenthooligan
Copy link
Copy Markdown

@silenthooligan silenthooligan commented May 5, 2026

Five focused changes that together unbreak EZSP/EmberZNet on Python 3.14 and fix several long-standing rough edges with TCP-bridged serial radios (ser2net, ESPHome stream_server / serial_proxy).

Supersedes #711, which was closed unmerged by its author with no review. This PR rebases that work onto current dev, drops the diagnostic-only LOGGER.warning calls #711 carried alongside the fixes, restores the post-#714 NETWORK_COORDINATOR_STARTUP_RESET_WAIT = 2, adds the bellows/cli/util.py get_event_loop fix, and is validated end-to-end against real hardware (see below).

Python 3.14 compatibility

  • bellows/thread.py: asyncio.iscoroutinefunctioninspect.iscoroutinefunction. Same semantics, drops the deprecation warning.
  • bellows/thread.py, bellows/uart.py: replace four asyncio.get_event_loop() calls (deprecated outside a running loop on 3.14) with asyncio.get_running_loop(). All four sites are reached from async def paths.
  • bellows/cli/util.py: background() decorator wrapped CLI commands with asyncio.get_event_loop().run_until_complete(...), which raises RuntimeError: no current event loop on 3.14 from a fresh sync entry point. Replaced with asyncio.run(...).

ThreadsafeProxy: closed-loop as ConnectionError

When the secondary event-loop thread is gone, ThreadsafeProxy used to log a warning and return None, which propagated as TypeError: 'NoneType' object can't be awaited and (on 3.14) left ZHA in an indefinite retry loop. Raise ConnectionError so callers can handle the failure cleanly. Also guards the loop being closed between the is_closed() check and run_coroutine_threadsafe() dispatch.

EZSP startup race over TCP

Reset frames from the NCP can arrive faster than wait_for_startup_reset() is reached when the transport is TCP. The first reset gets dropped, enter_failed_state() fires, and the integration never recovers without a manual restart.

  • bellows/uart.py::_connect: pre-create gateway._startup_reset_future before opening the connection so any reset frame received during setup is caught by reset_received().
  • bellows/uart.py::wait_for_startup_reset: replace assert is None with if is None so the pre-created future isn't rejected.

TCP serial-bridge connection lifecycle

  • bellows/ash.py::eof_received: return True to suppress the transport's auto-close. Serial-over-TCP bridges (ser2net, ESPHome stream_server / serial_proxy) sometimes signal EOF during initialization without intending to close the socket; the default auto-close orphans the connection.
  • bellows/ezsp/__init__.py::_startup_reset: raise a clear EzspError when self._gw is None.
  • bellows/ezsp/__init__.py::disconnect: tolerate _gw is None, and on ConnectionError from the gateway force-close the underlying TCP socket so ser2net/stream_server releases the port for subsequent attempts.

Test plan

  • All 112 existing tests pass on Python 3.14.2.
  • tests/test_thread.py::test_proxy_loop_closed now asserts ConnectionError (was: silently returns).
  • tests/test_ezsp.py::test_startup_reset_gw_none covers the new null-gateway guard.
  • tests/test_ezsp.py::test_disconnect_gw_none covers tolerance of null gateway in disconnect().

End-to-end validation against real hardware

Patched bellows was vendored into a Home Assistant 2026.5.0b2 image and exercised against a live Nabu Casa Connect ZBT-2 (EFR32MG24 Zigbee NCP) behind an ESPHome stream_server raw-TCP bridge on Python 3.14.2:

  1. bellows.ezsp.EZSP.connect() + startup_reset() complete cleanly. EZSP_VERSION = 13 negotiated. get_board_info() returns NCP identity (Nabu Casa, Connect ZBT).
  2. ZHA boots fresh, forms a network on channel 25, pairs five IAS-Zone water-leak sensors (Samjin model water).
  3. After OTA-ing the dongle to ESPHome serial_proxy (encrypted native API on :6053 only) and reconfiguring ZHA to esphome-hass://esphome/<entry_id>?port_name=..., the existing zigbee.db is reused via setup_strategy_advanced -> reuse_settings. All five sensors auto-rejoin without re-pair, attribute reports flow (battery, temperature, water-leak state).

Without this patch, step 1 fails immediately on Python 3.14 with 'NoneType' object can't be awaited (closed event loop) or AttributeError (get_event_loop no current loop), depending on which call path fires first. Step 3 also exercises the EOF-during-init suppression — the encrypted native-API tunnel issued an EOF during the noise-protocol handshake that the eof_received -> return True change keeps open.

Provenance

The bulk of the runtime fixes (and their tests) were originally drafted by @aautem in #711. Credited via Co-Authored-By trailer in the commit.

Five focused changes that together unbreak EZSP/EmberZNet on Python 3.14
and fix several long-standing rough edges with TCP-bridged serial radios
(ser2net, ESPHome stream_server / serial_proxy).

## Python 3.14 compatibility

- `bellows/thread.py`: `asyncio.iscoroutinefunction` is deprecated in 3.12
  and slated for removal; use `inspect.iscoroutinefunction` instead. Same
  semantics, no deprecation warning.
- `bellows/thread.py`, `bellows/uart.py`: replace four call sites of
  `asyncio.get_event_loop()` (deprecated outside a running loop) with
  `asyncio.get_running_loop()`. All four sites are reached from `async
  def` paths so they have a running loop.
- `bellows/cli/util.py`: `background()` decorator wrapped CLI commands
  with `asyncio.get_event_loop().run_until_complete(...)`, which raises
  `RuntimeError: no current event loop` on 3.14 from a fresh sync entry
  point. Replaced with `asyncio.run(...)` which is the documented modern
  equivalent.

## ThreadsafeProxy: surface closed-loop as ConnectionError

When the secondary event-loop thread is gone, `ThreadsafeProxy` used to
log a warning and return `None`, which propagated as
`TypeError: 'NoneType' object can't be awaited` at the call site —
opaque, and on Python 3.14 it left the integration in an indefinite
retry loop without recovery. Raise `ConnectionError` instead so callers
(EZSP startup_reset, retries) can handle the failure cleanly. Also
guards against the loop being closed between the `is_closed()` check
and `run_coroutine_threadsafe()` dispatch.

## EZSP startup race over TCP

Reset frames from the NCP can arrive on the wire faster than
`wait_for_startup_reset()` is reached when the underlying transport is
TCP. The first reset gets dropped, `enter_failed_state()` fires, and the
integration never recovers without a manual restart.

- `bellows/uart.py::_connect`: pre-create `gateway._startup_reset_future`
  before opening the connection so any reset frame received during
  setup is caught by `reset_received()` instead of being treated as
  spurious.
- `bellows/uart.py::wait_for_startup_reset`: replace the
  `assert is None` with `if is None` so the pre-created future isn't
  rejected.

## TCP serial-bridge connection lifecycle

- `bellows/ash.py::eof_received`: return `True` to suppress the
  transport's auto-close. Serial-over-TCP bridges (ser2net, ESPHome
  stream_server / serial_proxy) sometimes signal EOF during
  initialization handshakes without intending to close the socket; the
  default auto-close orphans the connection.
- `bellows/ezsp/__init__.py::_startup_reset`: raise a clear `EzspError`
  when `self._gw` is `None` instead of failing with an opaque
  `AttributeError`.
- `bellows/ezsp/__init__.py::disconnect`: tolerate `_gw is None`, and on
  `ConnectionError` from the gateway's disconnect (secondary loop dead)
  force-close the underlying TCP socket so ser2net/stream_server
  releases the port for subsequent attempts.

## Tests

- `tests/test_thread.py::test_proxy_loop_closed`: now asserts
  `ConnectionError` is raised (was: silently returns).
- `tests/test_ezsp.py::test_startup_reset_gw_none`: covers the new
  null-gateway guard.
- `tests/test_ezsp.py::test_disconnect_gw_none`: covers tolerance of
  null gateway in `disconnect()`.

## Validation

Patched bellows was vendored into a Home Assistant 2026.5.0b2 image and
exercised against a live Nabu Casa Connect ZBT-2 (EFR32MG24 Zigbee NCP)
behind an ESPHome `stream_server` raw-TCP bridge on Python 3.14.2:
- `bellows.ezsp.EZSP.connect()` + `startup_reset()` complete; protocol
  version 13 read; `get_board_info()` returns NCP identity.
- ZHA boots, forms a fresh network, pairs 5 IAS-Zone water-leak sensors.
- After OTA-ing the dongle from `stream_server` to ESPHome `serial_proxy`
  (encrypted native API), reconfiguring ZHA to
  `esphome-hass://esphome/<entry_id>?port_name=<port>` reuses the
  existing zigbee.db; all 5 sensors auto-rejoin without re-pair.

The full upstream test suite (112 tests) passes.

## Provenance

The bulk of the runtime fixes (and their tests) were originally drafted
by @aautem in zigpy#711 (closed by author with no review). This PR rebases
that work onto current `dev`, drops the diagnostic-only `LOGGER.warning`
calls that zigpy#711 carried alongside the fixes, restores the post-zigpy#714
`NETWORK_COORDINATOR_STARTUP_RESET_WAIT = 2`, and adds the
`bellows/cli/util.py` get_event_loop fix and the EZSP-over-TCP
end-to-end validation against real hardware.

Co-Authored-By: Alex Autem <autem.alex@gmail.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.54%. Comparing base (4b97a6d) to head (384975d).

Additional details and impacted files
@@           Coverage Diff           @@
##              dev     #720   +/-   ##
=======================================
  Coverage   99.54%   99.54%           
=======================================
  Files          61       61           
  Lines        4147     4155    +8     
=======================================
+ Hits         4128     4136    +8     
  Misses         19       19           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

silenthooligan pushed a commit to silenthooligan/code-sharing that referenced this pull request May 5, 2026
The ZBT-2 Zigbee YAML now ships ESPHome `serial_proxy` (encrypted
native API on :6053) as the default. ZHA reaches the radio via
  esphome-hass://esphome/<entry_id>?port_name=MG24%20Zigbee%20NCP
on HA 2026.5+. The previous `stream_server` raw-TCP variant moves to
zbt-2-zigbee/legacy-stream-server/ as the pre-2026.5 / pre-bellows-fix
fallback.

Status & dependencies are documented in zbt-2-zigbee/README.md and the
top-level README. Two upstream gates: HA >= 2026.5 (for the
`esphome-hass://` URL handler in homeassistant/components/esphome/
serial_proxy.py) and zigpy/bellows#720 (Python 3.14 + EZSP-over-TCP
runtime fixes; until merged, vendor patched bellows into the HA image).

Validated against HA 2026.5.0b2 + the patched bellows fork. EZSP
startup completes, ZHA pairs IAS-Zone end devices and routers,
attribute reports flow.

Migration walkthrough (existing stream_server -> serial_proxy without
re-pairing devices) is in zbt-2-zigbee/README.md. Key gotcha: ZHA
doesn't support reconfiguring radio path, must delete + recreate using
"Advanced -> Reuse settings" so the existing zigbee.db is honored.

Closes #3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three focused tests for the new code paths introduced in this PR:

- `tests/test_thread.py::test_proxy_coroutine_loop_closed_mid_dispatch`
  exercises the `RuntimeError` catch around `run_coroutine_threadsafe()`
  when the loop is closed between the `is_closed()` check and dispatch:
  asserts ConnectionError is raised and the orphaned coroutine is closed
  (no un-awaited-coroutine warning).
- `tests/test_ezsp.py::test_disconnect_force_closes_socket_on_connection_error`
  exercises the `disconnect()` path that, on `ConnectionError` from the
  gateway, walks the proxy/transport chain to force-close the underlying
  TCP socket so ser2net / stream_server / serial_proxy releases the
  port.
- `tests/test_ezsp.py::test_disconnect_socket_force_close_swallows_exceptions`
  exercises the inner `except Exception: pass` fallback when the
  proxy/transport chain isn't fully populated (e.g. `_obj` has no
  `_transport` attribute), confirming `_gw` is still cleared.

Lifts patch coverage from 63.6% to 100% on `bellows/thread.py` and
brings the EZSP changes within the project's 99.25% gate.
@puddly
Copy link
Copy Markdown
Contributor

puddly commented May 5, 2026

Can you attach some debug logs of the issue you're having and how this PR solves it? ZHA is used with a ton of TCP coordinators and we haven't really had reports of issues like this.

@silenthooligan
Copy link
Copy Markdown
Author

silenthooligan commented May 5, 2026

Live logs against a Nabu Casa Connect ZBT-2 (EFR32MG24, ESPHome stream_server raw-TCP bridge on 192.168.40.16:6638) from inside a fresh python:3.14-slim container, comparing stock 0.49.1 vs this PR.

Full bellows DEBUG output for all three runs + the test scripts are in this gist:
https://gist.github.com/silenthooligan/6d1399b94ebad2716a2e0b39a1809c7f

Summary:

01: stock 0.49.1, first connect (no teardown)
EZSP startup completes cleanly, board info reads. The recent #714 startup-wait extension dodges the original startup race on first connect, so the bug doesn't reproduce here on a clean boot. It surfaces on reconnect cycles instead, see runs 02/03.

02: stock 0.49.1, secondary loop teardown then dispatch

>>> Phase 1: connect + startup
>>> connected, EZSP version 13
>>> Phase 2: closing the secondary loop
>>> secondary loop running=False closed=True
>>> Phase 3: dispatch a command via the proxy
WARNING [bellows.thread] Attempted to use a closed event loop
>>> dispatch FAILED: TypeError: 'NoneType' object can't be awaited
>>> Phase 4: clean disconnect
WARNING [bellows.thread] Attempted to use a closed event loop
>>> disconnect FAILED: TypeError: 'NoneType' object can't be awaited

Caller has no signal that the connection died (just a TypeError on None), so the integration's retry path can't tell the difference between a transient command failure and a dead transport. ZHA loops without ever rebuilding the gateway.

03: patched bellows (this PR), same teardown

>>> Phase 1: connect + startup
>>> connected, EZSP version 13
>>> Phase 2: closing the secondary loop
>>> secondary loop running=False closed=True
>>> Phase 3: dispatch a command via the proxy
>>> dispatch FAILED: ConnectionError: Attempted to use a closed event loop, the connection may have been lost
>>> Phase 4: clean disconnect
>>> disconnect OK

ConnectionError is raised on dispatch (catchable, signals "rebuild the connection"), and disconnect() succeeds because the PR's fallback force-closes the underlying TCP socket when the gateway can't be reached.

Three conditions for the bug to fire in production: closed secondary loop, caller mid-dispatch, Python 3.14's stricter await None. Stable single-boot ConBee/SkyConnect setups don't see it. Setups that cycle (HA restarts, host reboots, WiFi-bridged dongles flapping) do.

Other bits in this PR (uart.py::wait_for_startup_reset race pre-create, ash.py::eof_received -> True for serial-over-TCP bridges that signal EOF mid-handshake) are TCP-only and don't surface on the first-connect run because #714 already extended the startup wait. They show up on reconnect attempts against ESPHome serial_proxy (encrypted native API) where the noise handshake itself produces an EOF that auto-closes the socket on stock.

@puddly
Copy link
Copy Markdown
Contributor

puddly commented May 5, 2026

Can you attach a log from the live device that you're communicating with?

@silenthooligan
Copy link
Copy Markdown
Author

Live logs are in this gist (full bellows DEBUG for all three runs plus the test scripts):
https://gist.github.com/silenthooligan/6d1399b94ebad2716a2e0b39a1809c7f

Hardware: Nabu Casa Connect ZBT-2 (EFR32MG24) on an ESPHome stream_server raw-TCP bridge at 192.168.40.16:6638. Driver-side tests run from python:3.14-slim.

The race only fires when the secondary loop closes mid-dispatch (HA restart, reconnect cycle). Clean first-connect runs dodge it because #714's startup-wait extension already covers that window.

Stock 0.49.1, secondary loop closed then dispatch:

WARNING [bellows.thread] Attempted to use a closed event loop
>>> dispatch FAILED: TypeError: 'NoneType' object can't be awaited
>>> disconnect FAILED: TypeError: 'NoneType' object can't be awaited

This PR, same scenario:

>>> dispatch FAILED: ConnectionError: Attempted to use a closed event loop, the connection may have been lost
>>> disconnect OK

Caller now gets a catchable ConnectionError instead of a bare TypeError on None, and disconnect() succeeds because the PR's fallback force-closes the underlying TCP socket when the gateway can't be reached. ZHA's reconnect path can act on the former; with the latter it hangs.

Comment thread bellows/ezsp/__init__.py Outdated
try:
ash = self._gw._obj._transport
if ash is not None and ash._transport is not None:
sock = ash._transport.get_extra_info("socket")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transport.get_extra_info("socket") is not part of the latest version of zigpy or bellows, please make sure you're using the latest version of the packages. I don't think this is a good pattern.

…eview

The ConnectionError fallback in `EZSP.disconnect()` reached past zigpy's
transport abstraction by walking `_gw._obj._transport._transport` and
calling `get_extra_info("socket")` to force-close the underlying TCP
socket. Two problems with this:

1. Pattern bypasses the public abstraction by relying on private
   attributes that aren't part of the contract.
2. zigpy 1.4.0's serialx migration changed the transport hierarchy, so
   the walk hits the inner `except Exception: pass` and silently
   no-ops on current HA anyway.

Replace the inner force-close with a plain `pass`: drop the gateway
reference on ConnectionError and let the caller rebuild from scratch.
The motivating concern (TCP socket lingering half-open until OS timeout
on ser2net / ESPHome stream_server bridges) is real but should be
handled at the serialx transport layer or by the OS half-open timeout,
not by reaching past the abstraction.

Validated end-to-end with the same teardown cycle from the original
review: the patched `disconnect()` still returns `>>> disconnect OK`
because the simplified handler still swallows ConnectionError. The
force-close fallback was inert on the live-hardware run against the
ZBT-2 over `socket://...:6638`.

Test simplified accordingly: drop
`test_disconnect_force_closes_socket_on_connection_error` (asserted the
private-attribute walk worked) and merge the two remaining cases into
`test_disconnect_swallows_connection_error`.

Full suite: 428 pass on Python 3.14.2 (down from 429, the dropped test).
`bellows/thread.py` coverage stays 100%.
Comment thread bellows/uart.py Outdated
# Pre-create the startup reset future before opening the connection so that
# reset frames arriving immediately after connect are captured by
# reset_received() instead of triggering enter_failed_state().
gateway._startup_reset_future = loop.create_future()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be simpler to have _startup_reset_future default to loop.create_future() within Gateway.

…view

Per puddly's review on uart.py:119, move the startup-reset-future
pre-creation out of `_connect()` and into `Gateway.__init__()` itself
so the future's lifecycle is owned by the class that uses it.

Gateway gets an optional `loop=None` keyword. When passed (from
`_connect()` which is in async context with a running loop), the future
is pre-created. When not passed (the path tests use, constructing
Gateway in sync context or as a plain unit-test fixture), it stays
None and `wait_for_startup_reset()` lazily creates it on first call.
That preserves the existing test contract (`assert gw._startup_reset_future is None`
still holds for bare Gateway construction) without giving up the
race-fix behavior on the real connect path.

Tests: 428 pass on Python 3.14.2, no regressions vs `7cf20c0`.
@silenthooligan
Copy link
Copy Markdown
Author

Apologies @puddly. Didn't review the diff carefully before opening. The day job has been demanding lately and I was swivel-chairing with this as a distraction trying to get the ZBT-2 WiFi transport (ESPHome stream_server / serial_proxy bridge) running on my own setup. Used Claude to iterate fixes against live hardware, which is how both rough edges you flagged ended up in the patch. My fault for not catching them on second pass.

Both addressed:

  1. bellows/ezsp/__init__.py force-close fallback. Dropped. You're right the walk into asyncio internals via private attrs isn't the way, and zigpy 1.4's serialx migration already made it a no-op for current HA. Simplified handler now just clears _gw on ConnectionError. Live cycle re-run shows Phase 4 still hits >>> disconnect OK, confirming the force-close was inert. New log added to the gist as 04-patched-cycle-simplified.log.
  2. bellows/uart.py _startup_reset_future pre-create. Moved into Gateway.__init__ per your suggestion. Optional loop=None keyword so existing tests that construct Gateway in non-async context keep the original None default; _connect() passes the loop in to trigger pre-create on the real path.

Tests: 428 pass on Python 3.14.2 across the two new commits, no regressions.

bellows itself was already current (PR base is dev HEAD 4b97a6df, no missed commits). zigpy 1.4 serialx is the moving piece I should have caught myself before submitting.

Comment thread bellows/ash.py Outdated
def eof_received(self):
def eof_received(self) -> bool:
self._ezsp_protocol.eof_received()
# Return True to prevent the transport from auto-closing. For
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the remote signal EOF without closing? The asyncio docs state machine shows:

start -> connection_made
    [-> data_received]*
    [-> eof_received]?
-> connection_lost -> end

And the docs state:

This method may return a false value (including None), in which case the transport will close itself. Conversely, if this method returns a true value, the protocol used determines whether to close the transport. Since the default implementation returns None, it implicitly closes the connection.

I don't think this change is right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I borrowed that hunk from #711's commit body (the closed-unmerged PR by @aautem this one rebased from) without validating its premise myself, it claimed an init-time spurious EOF on serial-over-TCP. You're right that the asyncio docs are clear, and honestly I don't have a reproduction of that scenario on my own setup either: stock 0.49.1 on Python 3.14.2 against the ZBT-2 over stream_server completes EZSP startup cleanly on first connect (log in the gist as 01-stock-probe.log).

Dropping it and deferring to you on this one. You know this code far better than I do, and I'd rather get it right than push something speculative. Door's open if a concrete EOF-during-init case surfaces later for some bridge config.

Reverted in f43c38f.

Per puddly's review: returning True from eof_received was carried over
from zigpy#711's commit body, which claimed an init-time spurious EOF on
serial-over-TCP that auto-close turned terminal. The asyncio docs are
explicit that EOF means the remote sent FIN; there is no
"EOF-without-intent-to-close" path. ASH/EZSP needs full duplex anyway,
so half-close keep-alive doesn't help us even when the asyncio pattern
is technically valid for other protocols.

The hunk wasn't reproducible against the actual ZBT-2 over stream_server
(stock 0.49.1 + Python 3.14.2 completes EZSP startup cleanly on first
connect, no spurious eof_received firing), so dropping it loses no
demonstrable functionality. If a concrete EOF-during-init case for
some bridge config surfaces later it can be addressed separately.

Tests: 428 pass on Python 3.14.2, no change in count.
ruff (the pinned v0.0.261 in pre-commit-config) flagged the simplified
disconnect handler from 7cf20c0 with SIM105 (try-except-pass should be
contextlib.suppress). Same behavior, idiomatic form. contextlib was
already imported.

Tests: 428 pass on Python 3.14.2, no behavioral change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants