Skip to content

fix(buzz-acp): clear stale 👀 on events delivered via native steer#1498

Open
tlongwell-block wants to merge 3 commits into
mainfrom
fix/steer-stale-eyes
Open

fix(buzz-acp): clear stale 👀 on events delivered via native steer#1498
tlongwell-block wants to merge 3 commits into
mainfrom
fix/steer-stale-eyes

Conversation

@tlongwell-block

@tlongwell-block tlongwell-block commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Problem

Since goose-native steering landed (#1160), messages absorbed into an in-flight turn via SteerAck::Success keep their 👀 reaction forever. Reported by Jem in #buzz-bugs: switched channels mid-turn, came back, eyes still on the message with no sign of activity.

Root cause

The 👀 is added fire-and-forget at queue-push time. Cleanup is owned by run_prompt_task's ReactionGuard, which collects event ids from the FlushBatch. But a successfully steered event is consumed by queue.remove_event() in the SteerAck::Success arm and never enters any batch — so no guard ever owns its id, and the 👀 is never removed. Every successfully steered message leaks its eyes.

Fix

Contract: eyes go away when the agent turn stops. One cleanup helper (spawn_steered_eyes_cleanup) is shared by every path.

  • TaskMeta.steered_event_ids: the SteerAck::Success arm attaches the consumed event id to the in-flight turn's meta.
  • Exact-turn binding (review, Wren): pool.send_steer returns the turn's tokio::task::Id; the ack watcher carries it, and record_steered_event matches on the exact task id — a delayed ack for an ended turn can never bind to a successor turn on the same channel. Stale id → immediate 👀 removal instead.
  • Turn end — normal completion (handle_prompt_result) or panic (recover_panicked_agent) — drains the ids and removes the 👀.
  • Graceful shutdown (review, Wren): the post-loop drain bypasses handle_prompt_result, so tokio_main now drains all steered ids from every in-flight TaskMeta (drain_all_steered_event_ids) as the loop exits, and — after the grace period settles every prompt task — drops its steer_ack_tx and drains steer_ack_rx (drain_crossing_steer_acks) to recover Success acks whose watchers resolved after the loop stopped polling. Both cleanups are bound-awaited before exit; a 2 s per-recv timeout backstops a wedged watcher.
  • Known limitation (pre-existing, documented in fn docs): non-Success acks crossing shutdown and queued-but-never-dispatched events share the same stale-👀-at-exit fate; relay-side cleanup owns that class.

Testing

  • pool::tests: exact-id binding, false-when-gone, the generation-reuse race (turn A ends → turn B same channel → A's late ack must not bind to B), shutdown drain take-once semantics.
  • steered_eyes_lifecycle_tests: wire-level assertions against a relay stub that answers the reaction /query and captures the signed kind:5 (NIP-09) deletion posted to /events — normal completion, panic recovery, the exact shutdown composition, late-ack immediate cleanup, crossing-ack drain (Success kept, non-Success ignored), wedged-watcher timeout (paused time), and the empty no-op path.
  • Full cargo test -p buzz-acp: 416 passed, 0 failed. Clippy --all-targets clean, fmt clean.

npub1qyvc0c5kl4gqv2fd97fsk46tu378sqgy35vc83rvgfwne90sel7s0ed67d and others added 3 commits July 3, 2026 10:27
Events absorbed into an in-flight turn via the goose-native steer path
(SteerAck::Success) are consumed with queue.remove_event() and never
enter a FlushBatch — so run_prompt_task's ReactionGuard never owns
their event ids, and the 👀 added at queue-push time was never removed.
Every successfully steered message kept its eyes forever.

Fix: record steered event ids on the in-flight turn's TaskMeta
(pool.record_steered_event). When the turn ends — normal completion
(handle_prompt_result) or panic (recover_panicked_agent) — the drained
ids get a best-effort 👀 removal, matching the user-visible contract
that the eyes go away when the agent turn stops. If the ack races past
turn completion (no task in flight), the SteerAck arm cleans up
immediately instead.

Fixes the stuck-👀 report in #buzz-bugs (eyes persist after a steered
turn finishes).

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Review (Wren) found two correctness gaps in the steered-reaction fix:

1. Generation-reuse race: SteerAckEvent carried only (channel_id,
   event_id), and record_steered_event matched TaskMeta by channel. If
   turn A completed and turn B started on the same channel before A's
   delayed SteerAck::Success arrived, A's steered 👀 attached to B and
   survived until the wrong turn ended. send_steer now returns the
   turn's tokio::task::Id; the ack watcher carries it, and
   record_steered_event matches on the exact task id — a stale id
   refuses to bind (returns false) and the ack arm removes the 👀
   immediately.

2. Graceful shutdown bypassed cleanup: the post-loop drain consumes
   PromptResults directly (never via handle_prompt_result) and aborts
   stragglers, so steered ids held in TaskMeta died with the pool and
   their 👀 went stale. tokio_main now drains all steered ids
   (pool.drain_all_steered_event_ids) as the main loop exits and
   bound-awaits a best-effort removal task before process exit.

The three duplicated cleanup blocks are unified in
spawn_steered_eyes_cleanup. New lifecycle tests observe actual wire
behavior against a relay stub — the kind:5 (NIP-09) deletion e-tagging
the 👀 reaction — for normal completion, panic recovery, the shutdown
drain composition, and late-ack immediate cleanup; the generation race
is pinned at the binding level in pool::tests.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Review (Wren) found a final shutdown race in 4719362: the one-time
TaskMeta drain only captures steered ids whose SteerAck::Success was
already processed by the main loop. A watcher can resolve during the
30s grace period — after select! stopped polling steer_ack_rx — so a
successful pending steer in that window was attached to nothing and
its 👀 died stale at process exit.

Fix: after the grace-period drain settles every prompt task (so every
ack watcher has resolved or dropped), shutdown drops its steer_ack_tx
and drains the channel (drain_crossing_steer_acks), collecting Success
event ids and removing their 👀 alongside the TaskMeta drain's ids —
both bound-awaited before exit. A per-recv 2s timeout backstops a
wedged watcher. Non-Success acks are dropped: their events were
released to the queue being discarded, the same pre-existing stale-👀
fate as any queued-but-never-dispatched event at exit.

Tests: crossing_success_ack_is_drained_and_cleaned_on_the_wire pins
the drain (Success kept, non-Success ignored, channel-close exit) and
the wire-level kind:5 deletion; crossing_ack_drain_times_out_on_wedged_watcher
(paused time) pins the no-hang backstop.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant