Skip to content

feat(tts): add gradium tts support#2193

Open
BenWeekes wants to merge 6 commits into
mainfrom
feat/gradium-tts
Open

feat(tts): add gradium tts support#2193
BenWeekes wants to merge 6 commits into
mainfrom
feat/gradium-tts

Conversation

@BenWeekes

Copy link
Copy Markdown
Contributor

Summary

Adds the Gradium TTS extension (gradium_tts_python), a websocket streaming TTS built on AsyncTTS2BaseExtension, plus a voice_assistant_gradium example graph.

  • Vendor client (gradium_tts.py): websocket setup → readytext/end_of_streamaudio/error, with auth-error classification (401/403/1008 → fatal).
  • Extension (extension.py): streams audio through a single finalize path (_finalize_requestfinish_request); request-level TTFB metric (emitted once per request even across multiple vendor segments); audio framing prefers the server-ready sample rate, falling back to config.
  • Config (config.py): pydantic model with param normalization, output-format/sample-rate reconciliation, and api-key redaction in logs.
  • Example wiring: new voice_assistant_gradium predefined graph (deepgram ASR + openai LLM + gradium TTS), manifest dependency, regenerated lockfile.

Tests

Standalone suite mirrors the sibling TTS extensions (basic audio, flush, state machine, robustness, params, metrics, error handling), plus gradium-specific coverage for its text-batching and per-request socket model.

  • task test-extension EXTENSION=agents/ten_packages/extension/gradium_tts_python → 27 passed
  • task tts-guarder-test EXTENSION=gradium_tts_python CONFIG_DIR=tests/configs → 15 passed, 1 skipped

Follow-ups (non-blocking)

  • Add punctuation-only / list-transition robustness regressions to match the documented robustness matrix.
  • Add an explicit 48 kHz config/test if we advertise pcm_48000.
  • If a future refactor reuses one websocket across segments, revisit reconnect/backoff coverage.

Comment thread ai_agents/agents/ten_packages/extension/gradium_tts_python/config.py Outdated
Comment thread ai_agents/agents/ten_packages/extension/gradium_tts_python/config.py Outdated
@BenWeekes

Copy link
Copy Markdown
Contributor Author

TTS guarder + standalone test records (commit 931cc24)

Addressed both review comments on config.py:

  • sample_rate is the single source of truth — Gradium only supports PCM, so the wire output_format is now derived as pcm_<sample_rate> and is no longer a user-settable knob (removed the two conflicting normalize/derive helpers). This matches the closest peer minimax_tts_websocket and the convention across other TTS extensions (sample_rate is the canonical rate knob).
  • base_urlurl — the field holds the full websocket endpoint and is used as-is (no path appended in code), matching the url convention used by the modern WS extensions (minimax, vibevoice).

TTS guarder — task tts-guarder-test EXTENSION=gradium_tts_python

Run against the real Gradium API, from the branch worktree (container /app is a bind mount of this branch). 15 passed, 1 skipped in 363.94s.

Case Result
test_append_input ✅ passed
test_append_input_stress ✅ passed
test_append_input_without_text_input_end ✅ passed
test_append_interrupt ✅ passed
test_basic_audio_setting ✅ passed
test_corner_input ✅ passed
test_dump ✅ passed
test_dump_each_request_id ✅ passed
test_empty_text_request ✅ passed
test_flush ✅ passed
test_interleaved_requests ✅ passed
test_invalid_required_params ✅ passed
test_invalid_text_handling ✅ passed
test_metrics ✅ passed
test_miss_required_params ✅ passed
test_subtitle_alignment ⏭️ skipped (disabled by default; optional — only for providers that emit word/segment timing, which Gradium does not)

Standalone suite — task test-extension EXTENSION=...gradium_tts_python

28 passed (includes a new test_output_format_derived_from_sample_rate_only that locks in: a user-supplied output_format is ignored, sample_rate drives the derived pcm_<rate>, and it is not leaked as a vendor passthrough param).

Both runs were executed on the branch code; black format-check is clean.

@github-actions

Copy link
Copy Markdown

Code Review: Gradium TTS extension

Thanks for the thorough work here — the per-request socket model, the single finalize path, and the test suite mirroring the sibling extensions are all solid. A few things worth a look before merge.

Potential bugs

1. on_data double-queues every tts_text_input message (extension.py).
This is the most important one. The new on_data override appends each tts_text_input to self.ingress_messages, then calls super().on_data(...), which dispatches to request_tts(t) with that same parsed payload. Inside request_tts you then call self._pop_next_ingress_message(t.request_id, t), popping from the queue and processing the popped message instead of the one the base class handed you.

For a strictly FIFO base class these line up, but the design is fragile:

  • It assumes the base class delivers messages to request_tts in exactly the same order and count as on_data sees them. If the base class ever filters, coalesces, or reorders (e.g. drops a message after text_input_end), the queue and the dispatch stream drift permanently out of sync — every subsequent request processes a stale t.
  • _pop_next_ingress_message is also called once per request_tts, but on_data enqueues once per data event. If the base class ever invokes request_tts more than once for a payload (or zero times), the counts diverge.

What's the queue actually buying you over just using the t the base class passes in? If it's only to recover metadata/ordering, consider dropping the parallel queue and working directly off t. If it's genuinely needed, please add a comment explaining the invariant ("base class calls request_tts exactly once per tts_text_input, in order") since the correctness hinges entirely on it.

2. cancel_tts finalizes but request_contexts may already be gone.
cancel_tts sets context.finished = True then calls _finalize_request, which does del self.request_contexts[request_id]. If cancel_tts fires twice, or after a natural REQUEST_END finalize, the second _finalize_request returns early (good — context is None guard), but cancel_tts still calls self.client.cancel() and re-issues a send_tts_audio_end path only if context exists. Worth a quick test for double-cancel / cancel-after-completion to confirm no duplicate tts_audio_end.

3. Clean-close detection relies on getattr(exc, "code", None) == 1000 (gradium_tts.py).
You depend on websockets 14.x for additional_headers (good — the old extra_headers was removed). But the ConnectionClosedOK/ConnectionClosed exceptions in 14.x expose the close code via exc.rcvd.code / exc.sent.code, not a top-level exc.code attribute. The tests pass because they construct a fake exception with a literal .code = 1000, so this path is only ever exercised against the mock, never the real library exception. Please verify against a real ConnectionClosedOK from websockets 14 — if .code isn't present, a normal end-of-stream close (1000) will be reported as EVENT_TTS_ERROR instead of EVENT_TTS_END. The sibling ASR extensions parse the code out of str(ConnectionClosed) precisely because of this.

4. _wait_for_ready maps every recv failure to 401/500.
A timeout waiting for ready (server slow, network hiccup) becomes status_code=500 → NON_FATAL, which is fine, but a genuinely transient close gets classified via _closed_during_setup which treats codes {1000, 1008, 1011} as auth (401 → FATAL). A 1011 (server internal error) or even a clean 1000 during setup isn't necessarily an auth problem, and marking it FATAL stops retries. Consider narrowing the FATAL set to actual auth signals (1008 + the auth token list) and letting the rest stay NON_FATAL.

Code quality

  • Dead/duplicated state. The extension now keeps both per-request RequestContext and a parallel set of instance mirrors (self.current_request_finished, self.total_audio_bytes, self._audio_start_sent, self.current_request_sample_rate, self.sent_ts) that are continually copied back and forth via _apply_request_context and manual assignments. This is a lot of surface area for the two to disagree. Since synthesize_audio_sample_rate() is the only base-class hook that reads instance state mid-stream, consider keeping only the context dict and a single current_request_id, and deriving everything else from self.request_contexts[current_request_id].
  • _calculate_audio_duration_ms hardcodes bytes_per_sample = 2 and channels = 1 instead of synthesize_audio_sample_width() / synthesize_audio_channels() like the prior version. That's correct for Gradium's 16-bit mono PCM today, but a magic constant that silently breaks if the format ever changes. A short comment ("Gradium is always 16-bit mono PCM") would help.
  • _clear_local_request_state swallows flush errors with a bare except Exception: pass. At least a debug log would help diagnose dump-file issues.
  • README/manifest drift: the README documents params.url and a cLONiZ4hQ8VpQ4Sz default, which matches the new config — good. But the PR summary still references output_format/pcm_48000 follow-ups; just confirm the README's removal of output_format as a user knob is intentional and complete (it is, per config.py, since it's now derived).

Performance

  • The per-request connect → ready → text → end → disconnect cycle pays a full websocket handshake + ready round-trip on every flush segment. With the new text-batching (MIN_STREAM_FLUSH_CHARS=80) this is mitigated, but a long streamed LLM response that flushes several times will reconnect several times within one request, each adding handshake latency to TTFB for that segment. The TTFB metric is only emitted once per request, so this added latency on later segments is invisible in metrics. The follow-up note about reusing one socket across segments is the right call — worth prioritizing.

Security

  • API-key redaction in to_str now covers both self.api_key and params["api_key"] — nice catch, that's a real leak the old code had.
  • _connect falls back to status_code=401 whenever "401" in message or "403" in message. A server error body that merely contains the substring "401" (e.g. a request ID) would be misclassified as fatal auth. Low risk, but string-sniffing status codes is brittle; prefer the structured InvalidStatus.response.status_code path you already have and only fall back for the truly opaque case.

Tests

Good coverage of batching, interleaving, metrics-once, and the state machine. Gaps:

  • No test exercises the double-queue invariant in on_data (bug Update README.md #1) — e.g. interleaved messages where the base dispatch order could differ.
  • The clean-close test (Fix speech sdk #3) uses a hand-rolled exception with .code, so it doesn't protect against the real websockets 14 exception shape. Consider importing ConnectionClosedOK from websockets.exceptions and asserting your handler maps a real instance to EVENT_TTS_END.
  • No cancel-after-finalize / double-cancel test (modify playground default settings,fix voice type error #2).

Overall this is close. The on_data double-queue (#1) and the clean-close .code assumption (#3) are the two I'd want resolved or explicitly justified before merge; the rest are polish.

@github-actions

Copy link
Copy Markdown

Review: feat(tts): add gradium tts support

Solid, well-structured extension that closely mirrors the sibling TTS extensions. The per-request socket model, request-scoped RequestContext, single-emit TTFB, and the websockets >= 14 close-code handling are all done carefully, and the test suite is genuinely comprehensive. A few things worth a look before merge.

Potential bugs

  1. tts_flush clears all request state, not just the flushed request. In on_data, after delegating to the base class, _clear_local_request_state() wipes request_contexts, recorder_map, and ingress_messages wholesale. The whole PR is built around interleaved request_id handling (the on_data ordering override, per-request contexts). A flush that targets one turn discarding the buffered state of other in-flight requests seems to contradict that invariant. If flush is genuinely whole-pipeline this is fine, but worth confirming against base-class flush semantics and adding an interleaved-flush test.

  2. TTFB handler overwrites context.sent_ts. In _process_tts_text, the EVENT_TTS_TTFB_METRIC branch does context.sent_ts = datetime.now(). Since request_event_interval_ms in _finalize_request is computed as now - context.sent_ts, the reported interval is measured from first-audio rather than from request send. The TTFB value itself is correct (the client computes it from its own _sent_ts), but the interval semantics may not be what you intend. Worth a comment explaining the reset, or dropping it.

  3. current_request_sample_rate is instance-level under interleaving. synthesize_audio_sample_rate() returns the instance field, set per-chunk in _process_tts_text. Duration math correctly uses context.sample_rate, but anything the base class reads via synthesize_audio_sample_rate() can reflect a different request's rate when requests interleave. Probably benign given serialized request_tts, but flagging since the rest of the design is rigorously per-context.

Code quality

  • Auth classification by substring matching is brittle. _looks_like_auth_error / _is_auth_error scan for \"401\", \"403\", \"1008\", \"unauthorized\", etc. in free-text error bodies. A non-auth error whose message happens to contain 401 would be promoted to FATAL_ERROR (or vice-versa). The InvalidStatus.response.status_code path is the reliable signal; the string heuristic is a reasonable fallback but worth a comment noting its best-effort nature.
  • json_config type mismatch. The field is typed dict[str, Any] | str | None and forwarded raw in _send_setup, but manifest.json declares json_config as \"type\": \"string\". Either accept only the string form or document that a dict is also valid; right now schema and model disagree.
  • _disconnect swallows all errors silently (except Exception: pass). Fine for close, but a log_debug would help diagnose stuck-socket cases.
  • The _close_code helper and the on_data ordering override are both well-documented — nice.

Security

  • API-key redaction in to_str now covers both the top-level field and params[\"api_key\"] — good. Other params values are still logged verbatim; fine today, but if Gradium adds another secret-bearing param it won't be masked.
  • Dump files are never cleaned up. dump_path defaults to /tmp and _setup_recorder does os.makedirs(..., exist_ok=True); synthesized audio accumulates with no retention bound. dump defaults False, so low risk, but worth a README note that dumps are persistent and operator-managed.

Tests

Good coverage across basic audio, flush, state machine, params, metrics, and error handling, plus the gradium-specific interleaving and websockets-14 close-code regression. Gaps to consider:

  • An interleaved-flush test exercising concern Update README.md #1.
  • A test asserting request_event_interval_ms semantics (concern modify playground default settings,fix voice type error #2), so a future refactor doesn't silently change it.
  • Confirm the websockets >= 14 lower bound is pinned in the extension's Python requirements — the code now depends on additional_headers, websockets.asyncio.client.ClientConnection, and State.OPEN, none of which exist on older releases. A stale environment would fail at import/connect rather than gracefully.

The follow-ups you listed (punctuation/list robustness regressions, explicit 48 kHz coverage) are reasonable to defer. Nice work overall.

@BenWeekes

Copy link
Copy Markdown
Contributor Author

Update — commits 3bb3e909 and b2c1b819

On the on_data override (re: "the base class already implements on_data, so it shouldn't be needed")

I looked into removing the override and it turns out it's load-bearing — removing it drops text on interleaved requests. Keeping it, with a comment now explaining why.

Mechanism. Gradium is the only TTS extension that batches multiple same-request_id segments into a single vendor request (it accumulates pending_text and flushes on a sentence boundary or text_input_end, then deletes the per-request context on finalize). The base class buffers messages for a not-yet-active request_id and, in finish_request, re-queues them to the tail of its input queue. For interleaved requests this can hand request_tts() a request's text_input_end segment before an earlier segment of the same request — which flushes and deletes the context early, dropping the earlier text.

ingress_messages is a per-request_id FIFO captured at on_data arrival time and replayed in order in request_tts(), which neutralises that reordering.

Evidence. I empirically removed the override and ran the standalone suite — test_interleaved_requests_keep_separate_buffers fails, dropping interleave_req_2's first segment:

At index 1: ('interleave_req_2', 'It also finishes in one vendor request.')
  != ('interleave_req_2', 'Request two starts now and stays separate. It also finishes...')

Why the other extensions don't need this: I checked minimax, cartesia, elevenlabs, and xai_tts — they all send each segment to the vendor immediately inside request_tts, with no per-request text buffer, so an out-of-order text_input_end can't drop earlier text. Gradium's per-request-socket model makes tiny per-fragment requests expensive, which is why it batches — and why it alone is exposed.

b2c1b819 adds a thorough comment on on_data/ingress_messages documenting this invariant and pointing at the covering test.

Note: the cleaner long-term fix is in the base class — if finish_request preserved per-request arrival order when re-releasing buffered messages (rather than re-queuing to the tail), gradium could drop ingress_messages entirely and every extension would be safe. Happy to file that as a separate issue against ten_ai_base.

websockets 14 close-code (3bb3e909)

Separately fixed a latent bug: websockets~=14.0 exposes the close code on exc.rcvd/exc.sent (each a Close frame with .code), not a top-level exc.code. The clean-close path in _iter_messages relied on getattr(exc, "code", ...), so a real 1000 end-of-stream close would be misclassified as EVENT_TTS_ERROR instead of EVENT_TTS_END (the existing test passed only because it hand-built an exception with a top-level .code). Added a _close_code helper that reads rcvd/sent first and falls back to .code, plus a regression test using the real websockets-14 exception shape.

Standalone suite: 29 passed (28 + the new regression test); black clean.

@YiminW

YiminW commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

I don't think Gradium should batch all text segments with the same request_id before sending them to the vendor.

A few reasons:

  1. This is not how the other TTS extensions behave. Cartesia / ElevenLabs / xAI all forward each incoming TTSTextInput segment to the vendor/client layer as it arrives, and use text_input_end only to finalize the request. Gradium is currently different: it accumulates pending_text and only sends when the buffer reaches a sentence/size threshold or text_input_end.

  2. I don't think the AsyncTTS2BaseExtension ordering issue described here is accurate. tts2.py buffers messages from other request_ids while one request is being processed, and finish_request() releases the next buffered request back to the input queue in FIFO order. It serializes interleaved requests, but it should not reorder messages within the same request_id such that a text_input_end=True segment is processed before earlier segments of that same request. If we have a concrete repro for that, it should probably be fixed/tested in AsyncTTS2BaseExtension, not worked around by overriding on_data() in the Gradium extension.

  3. Gradium's official streaming guidance does not recommend waiting for the whole logical request before sending text. Their LLM-to-TTS recipe says to open one TTS WebSocket before the LLM starts, buffer tokens into word/phrase chunks, send each chunk as a text message, add <flush> at sentence/end boundaries, and read audio concurrently:
    https://docs.gradium.ai/guides/recipes/llm-to-tts.md

    The documented chunking rules are:

    • send complete words, phrases, or sentences
    • keep punctuation attached to the preceding word
    • avoid one-token-per-message unless tokens are already clean word chunks
    • add <flush> at natural boundaries
    • avoid frequent flushes because they can make prosody choppy

    In our pipeline, the LLM already guarantees that every emitted segment ends with punctuation, so the incoming segments should already satisfy Gradium's recommended chunking constraints.

@github-actions

Copy link
Copy Markdown

Review: feat(tts): add gradium tts support

Solid extension. It follows the sibling-TTS pattern, the per-request streaming session model is clean, and test coverage is good. A few issues worth addressing before merge.


Bugs / correctness

  1. json_config schema/type mismatch (likely runtime failure). In config.py, json_config is typed dict[str, Any] | str | None, and test_params.py passes a dict ({"speed": 1.1}). But manifest.json declares the param as "type": "string". The TEN property-schema validator will reject a dict value for a string-typed property when the graph passes an object, so a real graph using an object json_config may fail validation even though the unit test (which builds the config directly, bypassing the manifest) passes. Either declare it as an object in the manifest or constrain the Python type to str.

  2. Example graph still uses removed base_url key. property.json (the new voice_assistant_gradium graph) sets params.base_url: "wss://api.gradium.ai/...", but the config dropped base_url in favor of url. update_params() no longer extracts base_url, so it silently falls into vendor params and gets sent to the server in the setup payload as an unexpected base_url field, while the actual URL falls back to the default. The endpoint happens to match the default so it "works", but the stray key is sent to the vendor and a non-default region URL would be silently ignored. Rename to url.

  3. request_tts resets reader state on every new request_id without ensuring the prior reader finished. When request_id changes, _reset_request_state() runs but the previous _reader_task is not cancelled/awaited here (only cancel_tts/on_stop cancel it). If a new request arrives before the prior stream's reader finishes, the orphaned task can still call send_tts_audio_*/_finalize_request. _finalize_request guards on request_id != self.current_request_id (good), but the orphaned reader can still write to a stale recorder and emit audio frames against the old request. Consider awaiting _cancel_reader_task() on request transition.


Robustness

  1. audio_events() / _iter_messages has no idle timeout cap across a full request. WS_RECV_TIMEOUT=10s is per-recv. A vendor that keeps the socket open and dribbles a byte every 9s would never time out the request. The base class flush path covers interrupts, but a hung-but-alive socket has no overall ceiling. Minor, but worth noting given the follow-ups already mention reconnect/backoff.

  2. _disconnect() swallows all close exceptions silently (except Exception: pass). Fine for cleanup, but a debug log would help diagnose stuck-close situations.


Style / minor

  1. extension.py imports asyncio and uses a background reader task — good — but _cancel_reader_task swallows all exceptions with a bare except Exception: pass. At least log at debug so a crashing reader isn't invisible.

  2. config._ensure_dict silently converts any non-dict params (including a populated list/str) to {}, discarding user input without warning. A log_warn on the discard path would save debugging time.

  3. Auth-error detection is duplicated in two places: extension._is_auth_error (checks 401/403/unauthorized/forbidden) and gradium_tts._looks_like_auth_error (a superset). Consider consolidating to one source of truth to avoid divergence.


Test coverage

Coverage is strong — basic audio, dump, flush, segmented sessions, metrics (incl. once-per-request TTFB), params passthrough, sample-rate derivation, robustness (empty/whitespace/punctuation/long/special), and auth-error classification. Two gaps:

  • No test asserts the base_urlurl migration, which is exactly why issue modify playground default settings,fix voice type error #2 slipped through. A test that loads the example graph property or asserts base_url is rejected/ignored would catch it.
  • The mock client (gradium_mocks.py) models the streaming API faithfully but never exercises the real _close_code / _closed_during_setup logic for websockets >= 14 close frames. Given the careful version-compat handling there, a small unit test against a synthetic exception carrying rcvd.code would lock that behavior in.

Overall this is close. Issues #1 and #2 are the ones I'd want resolved before merge since they affect real graph configs, not just internals.

@BenWeekes

Copy link
Copy Markdown
Contributor Author

Update — stream segments immediately (commit 89da4f70)

This reworks the extension to forward each tts_text_input segment to Gradium as it arrives, instead of accumulating segments in a local buffer. It resolves the earlier review feedback about overriding on_data, and fixes the dropped-audio behaviour seen in manual testing.

What changed

  • gradium_tts.py — the client is now a persistent per-request streaming session: start_session() (connect + setup + ready) → send_text(segment) for each segment → end_input() (end_of_stream) → audio read concurrently via audio_events(). (Previously it opened one websocket per get() call.)
  • extension.pyrequest_tts() forwards each segment immediately over a single session for the request; a background reader streams audio back and finalizes on stream end. Removed the pending_text batching, _should_flush_pending_text, the on_data override, ingress_messages, and the per-request RequestContext dict.
  • Retains the websockets-14 close-code fix (exc.rcvd/exc.sent vs top-level exc.code).

Why

  • Matches the other websocket TTS extensions. Cartesia, ElevenLabs, xAI and Minimax all forward each segment to the vendor immediately and use text_input_end only to finalize. Gradium was the only one batching, which is also what forced the non-standard on_data override.
  • Matches Gradium's official LLM-to-TTS guidance (open one socket, send each chunk as a text message, read audio concurrently). Our LLM already emits punctuation-terminated segments, so each incoming segment is a valid chunk to send as-is.
  • Fixes dropped audio. Holding text in a local buffer that a finalize could flush-and-discard was dropping sections of speech. With immediate forwarding there is no buffer to lose, and the base class owns queuing/ordering — so the on_data override is no longer needed.

Verification

  • Manual: voice_assistant_gradium graph — clean multi-turn conversation; logs show each request completing with reason=REQUEST_END, one TTFB metric per request, no errors/warnings, api_key redacted.
  • TTS guarder (real Gradium API): 15 passed, 1 skipped (incl. test_interleaved_requests, test_flush, test_append_interrupt, test_corner_input, test_metrics).
  • Standalone suite: 29 passed (rewritten to the streaming API; adds immediate-forwarding, single-session-per-request, and punctuation-only coverage).
  • Formatting: task check (black, line-length 80) clean.

Net change: roughly −260 lines; the extension is now simpler and consistent with the rest of the TTS extensions.

Gradium previously accumulated same-request_id segments into a local buffer
and only flushed on a sentence/size threshold or text_input_end. Combined
with the base class re-queuing interleaved messages, this could drop sections
of TTS audio and required a non-standard on_data override.

Forward each segment to the vendor as it arrives over one persistent
per-request websocket session (start_session/send_text/end_input plus a
concurrent audio reader), matching Gradium's LLM-to-TTS streaming guidance
and the other websocket TTS extensions. The base class now owns
queuing/ordering, so the on_data override and ingress_messages are removed.

Also in this change:
- parse json_config (a JSON string per the manifest schema) into an object
  before sending it on the wire, so manifest, config and README agree.
- cancel the background reader task on a send/setup failure so it cannot
  outlive the request.
- read the websockets-14 close code from exc.rcvd/exc.sent (it is no longer
  a top-level exc.code) so a clean 1000 close is treated as end-of-stream.

Tests rewritten to the streaming API, with added coverage for immediate
forwarding, single-session-per-request, json_config parsing, reader-task
cancellation, and punctuation-only input.
@github-actions

Copy link
Copy Markdown

Review: feat(tts): add gradium tts support (#2193)

Solid extension. The refactor to a single-session-per-request streaming model with a background reader task, the once-per-request TTFB metric, and the websockets >= 14 close-code handling are all well done, and the test suite is genuinely thorough (state machine, metrics, params, robustness, error paths). A few things worth a look before merge.

Potential bugs / correctness

  1. Background reader task is not cancelled when a new request_id arrives mid-stream. request_tts resets state via _reset_request_state() when t.request_id != self.current_request_id, but that method does not touch self._reader_task. The only places the reader is cancelled are cancel_tts and on_stop. If a new request begins before the prior request's vendor stream has closed (reader still blocked on recv()), start_session() will _disconnect()/_connect() and replace self.ws, then a second reader task is created and assigned to self._reader_task — the orphaned first task is still iterating audio_events() -> self.ws.recv(), so two coroutines read the same new socket and can steal each other's frames. Recommend await self._cancel_reader_task() at the top of the new-request branch (before _reset_request_state). Reachable for back-to-back requests that arrive without an intervening flush.

  2. WS_RECV_TIMEOUT = 10s between chunks emits a hard error. In _iter_messages, an asyncio.TimeoutError yields EVENT_TTS_ERROR and finalizes the request. For long synthesis where the vendor pauses >10s between audio frames, this surfaces a spurious TTS error. If the vendor can legitimately go quiet that long, consider distinguishing "no data yet but socket open" from a true stall, or making the timeout configurable.

  3. assert self.ws is not None in _iter_messages / _send_json. Asserts are stripped under python -O, and a teardown race that nulls self.ws would become an AttributeError rather than a clean handled path. The cancel ordering mitigates this today, but an explicit if self.ws is None: return is more robust than an assert.

Style / maintainability

  1. Hardcoded bytes_per_sample = 2 / channels = 1 in _calculate_audio_duration_ms. Correct for Gradium's 16-bit mono PCM, but it diverges from the base helpers (synthesize_audio_sample_width() / synthesize_audio_channels()) the prior version used. The comment helps; consider deriving from the base config so duration math can't silently drift.

  2. README/manifest sync. Config table (params.url, retry_for_s, pronunciation_id) matches the manifest schema today — just keep them in lockstep on future edits.

Security

  • API key handling is correct: sent only via the x-api-key header, popped out of params in update_params so it can't leak into the setup passthrough payload (verified by test_params_passthrough), and redacted in to_str for both the top-level field and params.
  • User-configurable websocket url is expected for a vendor endpoint; no concern.

Test coverage

Strong overall. One gap maps to bug #1: no test exercises a new request arriving while the previous reader is still streaming. Worth adding given the concurrency model. The punctuation-only/list-transition robustness and explicit 48 kHz config are already noted as non-blocking follow-ups — agreed.

Nice work. Nothing here is a hard blocker except possibly #1, depending on how the base class serializes requests; the rest is polish.

Capture the gates that were easy to miss this round: the commitlint
body-max-line-length rule (there is no local commit-msg hook, so CI is the
first thing to catch a long body line), and a single pre-push checklist
covering black, lint, standalone tests, guarder tests and commit messages.

Also note the stale .ten/ gotcha: a leftover standalone install makes
task check report spurious reformatting and breaks the next install.
@github-actions

Copy link
Copy Markdown

Review: feat(tts): add gradium tts support

Solid, well-structured extension. It uses the correct websockets 14 API (additional_headers, ClientConnection.state == State.OPEN, close code read from exc.rcvd/exc.sent), redacts the API key in both the top-level field and params, emits TTFB once per request across multiple segments, and ships a genuinely good test suite — including explicit websockets-14 close-frame regression tests and connection-error classification. Nice work.

A few things worth considering before merge.

Potential issues

1. WS_RECV_TIMEOUT (10s) can conflict with retry_for_s.
_iter_messages aborts with an error if any single recv() exceeds 10s:

raw_msg = await asyncio.wait_for(self.ws.recv(), timeout=WS_RECV_TIMEOUT)

But the config exposes retry_for_s, which lets the server hold the connection while retrying. If a user sets retry_for_s above ~10s, the client-side recv timeout tears down a connection the server still considers healthy and surfaces a spurious EVENT_TTS_ERROR ("Timeout waiting for Gradium audio"). Consider deriving the recv timeout from retry_for_s (plus margin) when set, or documenting the ceiling.

2. Passthrough params can clobber reserved setup fields.
In _send_setup, the vendor-extras loop runs after the reserved keys are set:

if self.config.voice_id:
    payload["voice_id"] = self.config.voice_id
...
for key, value in self.config.params.items():
    payload[key] = value

A passthrough param named type, voice_id, voice, or json_config would silently overwrite the framework-set value (e.g. type: "setup"). Recommend skipping keys already present in payload, or keeping a reserved-key denylist, so a stray param can't break the setup envelope.

3. Possible orphaned reader task on request switch without text_input_end.
request_tts starts a reader task on first text, but on a new request_id it calls _reset_request_state() (not _cancel_reader_task()) before creating a new task:

self._reader_task = asyncio.create_task(self._read_audio(t.request_id))

In the normal flow the prior request ends via end_input and its reader has finished. But if a new request_id arrives while a prior reader is still live (no text_input_end seen), the old task reference is overwritten without being cancelled or awaited — an orphan. And since _finalize_request guards on request_id != self.current_request_id and returns early, that orphan's eventual finalize is dropped silently. cancel_tts/flush handle this correctly; the gap is the request-switch path. Cheap fix: await self._cancel_reader_task() at the top of the new-request branch.

Minor

  • _calculate_audio_duration_ms hardcodes bytes_per_sample = 2 / channels = 1 instead of using synthesize_audio_sample_width() / synthesize_audio_channels(). Correct for Gradium today (16-bit mono PCM) and the inline comment says so — just noting the divergence from the base-class helpers.
  • README advertises sample_rate ∈ {8000,16000,22050,24000,44100,48000} and the PR lists a follow-up to add an explicit 48 kHz config/test — worth doing since it's sent on the wire as pcm_48000.

Test coverage

Strong. Covers basic audio, flush, dump byte-for-byte comparison, segmented/single-TTFB metrics, params passthrough + json_config parsing, sample-rate variants, empty/whitespace/long/special/punctuation-only text, sequential requests, reconnect-after-error, clean-close → END (both legacy .code and websockets-14 .rcvd/.sent shapes), 1008 → fatal auth mapping, and reader-task cancellation on send failure. The two gaps mapping to the issues above would be good additions: a recv-timeout-vs-retry_for_s case, and a request-switch-without-text_input_end case asserting no orphaned task.

Overall this is in good shape — items 1–3 are the ones I'd want addressed or consciously waived before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants