fix(buzz-acp): clear stale 👀 on events delivered via native steer#1498
Open
tlongwell-block wants to merge 3 commits into
Open
fix(buzz-acp): clear stale 👀 on events delivered via native steer#1498tlongwell-block wants to merge 3 commits into
tlongwell-block wants to merge 3 commits into
Conversation
Events absorbed into an in-flight turn via the goose-native steer path (SteerAck::Success) are consumed with queue.remove_event() and never enter a FlushBatch — so run_prompt_task's ReactionGuard never owns their event ids, and the 👀 added at queue-push time was never removed. Every successfully steered message kept its eyes forever. Fix: record steered event ids on the in-flight turn's TaskMeta (pool.record_steered_event). When the turn ends — normal completion (handle_prompt_result) or panic (recover_panicked_agent) — the drained ids get a best-effort 👀 removal, matching the user-visible contract that the eyes go away when the agent turn stops. If the ack races past turn completion (no task in flight), the SteerAck arm cleans up immediately instead. Fixes the stuck-👀 report in #buzz-bugs (eyes persist after a steered turn finishes). Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Review (Wren) found two correctness gaps in the steered-reaction fix: 1. Generation-reuse race: SteerAckEvent carried only (channel_id, event_id), and record_steered_event matched TaskMeta by channel. If turn A completed and turn B started on the same channel before A's delayed SteerAck::Success arrived, A's steered 👀 attached to B and survived until the wrong turn ended. send_steer now returns the turn's tokio::task::Id; the ack watcher carries it, and record_steered_event matches on the exact task id — a stale id refuses to bind (returns false) and the ack arm removes the 👀 immediately. 2. Graceful shutdown bypassed cleanup: the post-loop drain consumes PromptResults directly (never via handle_prompt_result) and aborts stragglers, so steered ids held in TaskMeta died with the pool and their 👀 went stale. tokio_main now drains all steered ids (pool.drain_all_steered_event_ids) as the main loop exits and bound-awaits a best-effort removal task before process exit. The three duplicated cleanup blocks are unified in spawn_steered_eyes_cleanup. New lifecycle tests observe actual wire behavior against a relay stub — the kind:5 (NIP-09) deletion e-tagging the 👀 reaction — for normal completion, panic recovery, the shutdown drain composition, and late-ack immediate cleanup; the generation race is pinned at the binding level in pool::tests. Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
Review (Wren) found a final shutdown race in 4719362: the one-time TaskMeta drain only captures steered ids whose SteerAck::Success was already processed by the main loop. A watcher can resolve during the 30s grace period — after select! stopped polling steer_ack_rx — so a successful pending steer in that window was attached to nothing and its 👀 died stale at process exit. Fix: after the grace-period drain settles every prompt task (so every ack watcher has resolved or dropped), shutdown drops its steer_ack_tx and drains the channel (drain_crossing_steer_acks), collecting Success event ids and removing their 👀 alongside the TaskMeta drain's ids — both bound-awaited before exit. A per-recv 2s timeout backstops a wedged watcher. Non-Success acks are dropped: their events were released to the queue being discarded, the same pre-existing stale-👀 fate as any queued-but-never-dispatched event at exit. Tests: crossing_success_ack_is_drained_and_cleaned_on_the_wire pins the drain (Success kept, non-Success ignored, channel-close exit) and the wire-level kind:5 deletion; crossing_ack_drain_times_out_on_wedged_watcher (paused time) pins the no-hang backstop. Co-authored-by: Tyler Longwell <tlongwell@block.xyz> Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Since goose-native steering landed (#1160), messages absorbed into an in-flight turn via
SteerAck::Successkeep their 👀 reaction forever. Reported by Jem in #buzz-bugs: switched channels mid-turn, came back, eyes still on the message with no sign of activity.Root cause
The 👀 is added fire-and-forget at queue-push time. Cleanup is owned by
run_prompt_task'sReactionGuard, which collects event ids from the FlushBatch. But a successfully steered event is consumed byqueue.remove_event()in theSteerAck::Successarm and never enters any batch — so no guard ever owns its id, and the 👀 is never removed. Every successfully steered message leaks its eyes.Fix
Contract: eyes go away when the agent turn stops. One cleanup helper (
spawn_steered_eyes_cleanup) is shared by every path.TaskMeta.steered_event_ids: theSteerAck::Successarm attaches the consumed event id to the in-flight turn's meta.pool.send_steerreturns the turn'stokio::task::Id; the ack watcher carries it, andrecord_steered_eventmatches on the exact task id — a delayed ack for an ended turn can never bind to a successor turn on the same channel. Stale id → immediate 👀 removal instead.handle_prompt_result) or panic (recover_panicked_agent) — drains the ids and removes the 👀.handle_prompt_result, sotokio_mainnow drains all steered ids from every in-flightTaskMeta(drain_all_steered_event_ids) as the loop exits, and — after the grace period settles every prompt task — drops itssteer_ack_txand drainssteer_ack_rx(drain_crossing_steer_acks) to recover Success acks whose watchers resolved after the loop stopped polling. Both cleanups are bound-awaited before exit; a 2 s per-recv timeout backstops a wedged watcher.Testing
pool::tests: exact-id binding, false-when-gone, the generation-reuse race (turn A ends → turn B same channel → A's late ack must not bind to B), shutdown drain take-once semantics.steered_eyes_lifecycle_tests: wire-level assertions against a relay stub that answers the reaction/queryand captures the signed kind:5 (NIP-09) deletion posted to/events— normal completion, panic recovery, the exact shutdown composition, late-ack immediate cleanup, crossing-ack drain (Success kept, non-Success ignored), wedged-watcher timeout (paused time), and the empty no-op path.cargo test -p buzz-acp: 416 passed, 0 failed. Clippy--all-targetsclean, fmt clean.